CDSnake：使用 CD-HIT 工具从配对末端读取中检索带注释的 OTU 的 Snakemake 管道。

CDSnake: Snakemake pipeline for retrieval of annotated OTUs from paired-end reads using CD-HIT utilities.

机构信息

Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 199004.

Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia, 198515.

出版信息

BMC Bioinformatics. 2020 Jul 24;21(Suppl 12):303. doi: 10.1186/s12859-020-03591-6.

DOI:10.1186/s12859-020-03591-6

PMID:32703166

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7379820/

Abstract

BACKGROUND

Illumina paired-end reads are often used for 16S analysis in metagenomic studies. Since DNA fragment size is usually smaller than the sum of lengths of paired reads, reads can be merged for downstream analysis. In spite of development of several tools for merging of paired-end reads, poor quality at the 3' ends within the overlapping region prevents the accurate combining of significant portion of read pairs. Recently CD-HIT-OTU-Miseq was presented as a new approach for 16S analysis using the paired-end reads, it completely avoids the reads merging process due to separate clustering of paired reads. CD-HIT-OTU-Miseq is a set of tools which are supposed to be successively launched by auxiliary shell scripts. This launch mode is not suitable for processing of big amounts of data generated in modern omics experiments. To solve this issue we created CDSnake - Snakemake pipeline utilizing CD-HIT tools for easier consecutive launch of CD-HIT-OTU-Miseq tools for complete processing of paired end reads in metagenomic studies. Usage of pipeline make 16S analysis easier due to one-command launch and helps to yield reproducible results.

RESULTS

We benchmarked our pipeline against two commonly used pipelines for OTU retrieval, incorporated into popular workflow for microbiome analysis, QIIME2 - DADA2 and deblur. Three mock datasets having highly overlapping paired-end 2 × 250 bp reads were used for benchmarking - Balanced, HMP, and Extreme. CDSnake outputted less OTUs than DADA2 and deblur. However, on Balanced and HMP datasets number of OTUs outputted by CDSnake was closer to real number of strains which were used for mock community generation, than those outputted by DADA2 and deblur. Though generally slower than other pipelines, CDSnake outputted higher total counts, preserving more information from raw data. Inheriting this properties from original CD-HIT-OTU-MiSeq utilities, CDSnake made their usage handier due to simple scalability, easier automated runs and other Snakemake benefits.

CONCLUSIONS

We developed Snakemake pipeline for OTU-MiSeq utilities, which simplified and automated data analysis. Benchmarking showed that this approach is capable to outperform popular tools in certain conditions.

摘要

背景

Illumina 配对末端读数通常用于宏基因组研究中的 16S 分析。由于 DNA 片段大小通常小于配对读取长度的总和，因此可以合并读取以进行下游分析。尽管已经开发了几种用于合并配对末端读取的工具，但在重叠区域内 3' 末端的质量较差会阻止对大部分读对的准确组合。最近，CD-HIT-OTU-Miseq 被提出作为一种新的使用配对末端读取的 16S 分析方法，它完全避免了由于配对读取的单独聚类而导致的读取合并过程。CD-HIT-OTU-Miseq 是一组工具，预计将由辅助 shell 脚本依次启动。这种启动模式不适合处理现代组学实验中生成的大量数据。为了解决这个问题，我们创建了 CDSnake-Snakemake 管道，利用 CD-HIT 工具更轻松地连续启动 CD-HIT-OTU-Miseq 工具，以完成宏基因组研究中配对末端读取的完整处理。使用管道使 16S 分析更容易，因为只需一个命令即可启动，并且有助于产生可重复的结果。

结果

我们使用两种常用于 OTU 检索的常用管道（合并到流行的微生物组分析工作流程中），即 QIIME2-DADA2 和 deblur，对我们的管道进行了基准测试。使用三个高度重叠的配对末端 2×250 bp 读取的模拟数据集进行了基准测试 - 平衡、HMP 和极端。CDSnake 输出的 OTU 少于 DADA2 和 deblur。然而，在平衡和 HMP 数据集上，CDSnake 输出的 OTU 数量比 DADA2 和 deblur 更接近用于模拟群落生成的实际菌株数量。尽管通常比其他管道慢，但 CDSnake 输出的总计数更高，保留了更多来自原始数据的信息。CDSnake 从原始的 CD-HIT-OTU-MiSeq 实用程序继承了这些特性，由于简单的可扩展性、更轻松的自动化运行和其他 Snakemake 优势，使其使用更加方便。

结论

我们开发了用于 OTU-MiSeq 实用程序的 Snakemake 管道，简化和自动化了数据分析。基准测试表明，在某些条件下，这种方法能够胜过流行工具。

相似文献

CDSnake: Snakemake pipeline for retrieval of annotated OTUs from paired-end reads using CD-HIT utilities.CDSnake：使用 CD-HIT 工具从配对末端读取中检索带注释的 OTU 的 Snakemake 管道。

BMC Bioinformatics. 2020 Jul 24;21(Suppl 12):303. doi: 10.1186/s12859-020-03591-6.

Correction to: CDSnake: Snakemake pipeline for retrieval of annotated OTUs from paired-end reads using CD-HIT utilities.对《CDSnake：使用CD-HIT实用工具从双端读段中检索注释OTU的Snakemake流程》的勘误

BMC Bioinformatics. 2020 Aug 19;21(1):362. doi: 10.1186/s12859-020-03709-w.

From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data.从读取到可操作分类单元：用于MiSeq扩增子测序数据的集成处理流程

Gigascience. 2017 Feb 1;6(2):1-10. doi: 10.1093/gigascience/giw017.

Hybrid-denovo: a de novo OTU-picking pipeline integrating single-end and paired-end 16S sequence tags.混合从头开始：一种从头开始的 OTU 挑选流程，整合了单端和配对端 16S 序列标签。

Gigascience. 2018 Mar 1;7(3):1-7. doi: 10.1093/gigascience/gix129.

Concatenation of paired-end reads improves taxonomic classification of amplicons for profiling microbial communities.拼接成对的末端读取可提高微生物群落分析中扩增子分类的分类学分类。

BMC Bioinformatics. 2021 Oct 12;22(1):493. doi: 10.1186/s12859-021-04410-2.

Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering.利用长读长16S rRNA基因扩增子测序和通用层次聚类改进操作分类单元（OTU）挑选

Microbiome. 2015 Oct 5;3:43. doi: 10.1186/s40168-015-0105-6.

Joining Illumina paired-end reads for classifying phylogenetic marker sequences.将 Illumina 配对末端读取用于分类系统发育标记序列。

BMC Bioinformatics. 2020 Mar 14;21(1):105. doi: 10.1186/s12859-020-3445-6.

A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of the gut microbiome.用于肠道微生物组组成分析的测序平台和生物信息学管道的比较。

BMC Microbiol. 2017 Sep 13;17(1):194. doi: 10.1186/s12866-017-1101-8.

LotuS: an efficient and user-friendly OTU processing pipeline.LotuS：一个高效且用户友好的 OTU 处理流程。

Microbiome. 2014 Sep 30;2(1):30. doi: 10.1186/2049-2618-2-30.

Piphillin predicts metagenomic composition and dynamics from DADA2-corrected 16S rDNA sequences.Piphillin 可根据 DADA2 校正的 16S rDNA 序列预测宏基因组组成和动态。

BMC Genomics. 2020 Jan 17;21(1):56. doi: 10.1186/s12864-019-6427-1.

引用本文的文献

An exploratory study on the metagenomic and proteomic characterization of hypothyroidism in the first half of pregnancy and correlation with Th1/Th2 balance.妊娠前半期甲状腺功能减退症的宏基因组学和蛋白质组学特征及其与Th1/Th2平衡相关性的探索性研究

Front Immunol. 2025 May 15;16:1500866. doi: 10.3389/fimmu.2025.1500866. eCollection 2025.

Effect of microbial diversity and their functions on soil nutrient cycling in the rhizosphere zone of Dahongpao mother tree and cutting Dahongpao.微生物多样性及其功能对大红袍母树和扦插大红袍根际土壤养分循环的影响

Front Plant Sci. 2025 May 8;16:1574020. doi: 10.3389/fpls.2025.1574020. eCollection 2025.

An annotated near-complete sequence assembly of the Magnaporthe oryzae 70-15 reference genome.稻瘟病菌70-15参考基因组的带注释的近完整序列组装。

Sci Data. 2025 May 7;12(1):758. doi: 10.1038/s41597-025-05116-3.

Aerospace Mutagenized Tea Tree Increases Rhizospheric Microorganisms, Enhances Nutrient Conversion Capacity and Promotes Growth.航天诱变茶树增加根际微生物，提高养分转化能力并促进生长。

Plants (Basel). 2025 Mar 21;14(7):981. doi: 10.3390/plants14070981.

Multi-Omics Profiling of Lipid Variation and Regulatory Mechanisms in Poultry Breast Muscles.家禽胸肌脂质变化及调控机制的多组学分析

Animals (Basel). 2025 Feb 27;15(5):694. doi: 10.3390/ani15050694.

Gut microbiome characteristics of women with hypothyroidism during early pregnancy detected by 16S rRNA amplicon sequencing and shotgun metagenomic.采用 16S rRNA 扩增子测序和鸟枪法宏基因组学检测早孕期甲状腺功能减退症妇女的肠道微生物组特征。

Front Cell Infect Microbiol. 2024 Aug 9;14:1369192. doi: 10.3389/fcimb.2024.1369192. eCollection 2024.

BMC Bioinformatics. 2020 Aug 19;21(1):362. doi: 10.1186/s12859-020-03709-w.

本文引用的文献

Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns.Deblur能快速解析单核苷酸群落序列模式。

mSystems. 2017 Mar 7;2(2). doi: 10.1128/mSystems.00191-16. eCollection 2017 Mar-Apr.

VSEARCH: a versatile open source tool for metagenomics.VSEARCH：一款用于宏基因组学的多功能开源工具。

PeerJ. 2016 Oct 18;4:e2584. doi: 10.7717/peerj.2584. eCollection 2016.

DADA2: High-resolution sample inference from Illumina amplicon data.DADA2：从Illumina扩增子数据进行高分辨率样本推断。

Nat Methods. 2016 Jul;13(7):581-3. doi: 10.1038/nmeth.3869. Epub 2016 May 23.

Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform.深入了解Illumina MiSeq平台进行扩增子测序时的偏差和测序错误。

Nucleic Acids Res. 2015 Mar 31;43(6):e37. doi: 10.1093/nar/gku1341. Epub 2015 Jan 13.

Trimmomatic: a flexible trimmer for Illumina sequence data.Trimmomatic：一款适用于 Illumina 测序数据的灵活修剪工具。

Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1.

PEAR: a fast and accurate Illumina Paired-End reAd mergeR.PEAR：一种快速而准确的 Illumina 双端读取合并器。

Bioinformatics. 2014 Mar 1;30(5):614-20. doi: 10.1093/bioinformatics/btt593. Epub 2013 Oct 18.

Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform.开发一种双索引测序策略和策管管道，用于分析 MiSeq Illumina 测序平台上的扩增子序列数据。

Appl Environ Microbiol. 2013 Sep;79(17):5112-20. doi: 10.1128/AEM.01043-13. Epub 2013 Jun 21.

The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. SILVA 核糖体 RNA 基因数据库项目：改进的数据处理和基于网络的工具。

Nucleic Acids Res. 2013 Jan;41(Database issue):D590-6. doi: 10.1093/nar/gks1219. Epub 2012 Nov 28.

Snakemake--a scalable bioinformatics workflow engine.Snakemake——一个可扩展的生物信息学工作流引擎。

Bioinformatics. 2012 Oct 1;28(19):2520-2. doi: 10.1093/bioinformatics/bts480. Epub 2012 Aug 20.

FLASH: fast length adjustment of short reads to improve genome assemblies.FLASH：快速调整短读长以提高基因组组装质量。

Bioinformatics. 2011 Nov 1;27(21):2957-63. doi: 10.1093/bioinformatics/btr507. Epub 2011 Sep 7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验