Suppr超能文献

CDSnake:使用 CD-HIT 工具从配对末端读取中检索带注释的 OTU 的 Snakemake 管道。

CDSnake: Snakemake pipeline for retrieval of annotated OTUs from paired-end reads using CD-HIT utilities.

机构信息

Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 199004.

Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia, 198515.

出版信息

BMC Bioinformatics. 2020 Jul 24;21(Suppl 12):303. doi: 10.1186/s12859-020-03591-6.

Abstract

BACKGROUND

Illumina paired-end reads are often used for 16S analysis in metagenomic studies. Since DNA fragment size is usually smaller than the sum of lengths of paired reads, reads can be merged for downstream analysis. In spite of development of several tools for merging of paired-end reads, poor quality at the 3' ends within the overlapping region prevents the accurate combining of significant portion of read pairs. Recently CD-HIT-OTU-Miseq was presented as a new approach for 16S analysis using the paired-end reads, it completely avoids the reads merging process due to separate clustering of paired reads. CD-HIT-OTU-Miseq is a set of tools which are supposed to be successively launched by auxiliary shell scripts. This launch mode is not suitable for processing of big amounts of data generated in modern omics experiments. To solve this issue we created CDSnake - Snakemake pipeline utilizing CD-HIT tools for easier consecutive launch of CD-HIT-OTU-Miseq tools for complete processing of paired end reads in metagenomic studies. Usage of pipeline make 16S analysis easier due to one-command launch and helps to yield reproducible results.

RESULTS

We benchmarked our pipeline against two commonly used pipelines for OTU retrieval, incorporated into popular workflow for microbiome analysis, QIIME2 - DADA2 and deblur. Three mock datasets having highly overlapping paired-end 2 × 250 bp reads were used for benchmarking - Balanced, HMP, and Extreme. CDSnake outputted less OTUs than DADA2 and deblur. However, on Balanced and HMP datasets number of OTUs outputted by CDSnake was closer to real number of strains which were used for mock community generation, than those outputted by DADA2 and deblur. Though generally slower than other pipelines, CDSnake outputted higher total counts, preserving more information from raw data. Inheriting this properties from original CD-HIT-OTU-MiSeq utilities, CDSnake made their usage handier due to simple scalability, easier automated runs and other Snakemake benefits.

CONCLUSIONS

We developed Snakemake pipeline for OTU-MiSeq utilities, which simplified and automated data analysis. Benchmarking showed that this approach is capable to outperform popular tools in certain conditions.

摘要

背景

Illumina 配对末端读数通常用于宏基因组研究中的 16S 分析。由于 DNA 片段大小通常小于配对读取长度的总和,因此可以合并读取以进行下游分析。尽管已经开发了几种用于合并配对末端读取的工具,但在重叠区域内 3' 末端的质量较差会阻止对大部分读对的准确组合。最近,CD-HIT-OTU-Miseq 被提出作为一种新的使用配对末端读取的 16S 分析方法,它完全避免了由于配对读取的单独聚类而导致的读取合并过程。CD-HIT-OTU-Miseq 是一组工具,预计将由辅助 shell 脚本依次启动。这种启动模式不适合处理现代组学实验中生成的大量数据。为了解决这个问题,我们创建了 CDSnake-Snakemake 管道,利用 CD-HIT 工具更轻松地连续启动 CD-HIT-OTU-Miseq 工具,以完成宏基因组研究中配对末端读取的完整处理。使用管道使 16S 分析更容易,因为只需一个命令即可启动,并且有助于产生可重复的结果。

结果

我们使用两种常用于 OTU 检索的常用管道(合并到流行的微生物组分析工作流程中),即 QIIME2-DADA2 和 deblur,对我们的管道进行了基准测试。使用三个高度重叠的配对末端 2×250 bp 读取的模拟数据集进行了基准测试 - 平衡、HMP 和极端。CDSnake 输出的 OTU 少于 DADA2 和 deblur。然而,在平衡和 HMP 数据集上,CDSnake 输出的 OTU 数量比 DADA2 和 deblur 更接近用于模拟群落生成的实际菌株数量。尽管通常比其他管道慢,但 CDSnake 输出的总计数更高,保留了更多来自原始数据的信息。CDSnake 从原始的 CD-HIT-OTU-MiSeq 实用程序继承了这些特性,由于简单的可扩展性、更轻松的自动化运行和其他 Snakemake 优势,使其使用更加方便。

结论

我们开发了用于 OTU-MiSeq 实用程序的 Snakemake 管道,简化和自动化了数据分析。基准测试表明,在某些条件下,这种方法能够胜过流行工具。

相似文献

1
CDSnake: Snakemake pipeline for retrieval of annotated OTUs from paired-end reads using CD-HIT utilities.
BMC Bioinformatics. 2020 Jul 24;21(Suppl 12):303. doi: 10.1186/s12859-020-03591-6.
7
Joining Illumina paired-end reads for classifying phylogenetic marker sequences.
BMC Bioinformatics. 2020 Mar 14;21(1):105. doi: 10.1186/s12859-020-3445-6.
9
LotuS: an efficient and user-friendly OTU processing pipeline.
Microbiome. 2014 Sep 30;2(1):30. doi: 10.1186/2049-2618-2-30.
10
Piphillin predicts metagenomic composition and dynamics from DADA2-corrected 16S rDNA sequences.
BMC Genomics. 2020 Jan 17;21(1):56. doi: 10.1186/s12864-019-6427-1.

本文引用的文献

1
Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns.
mSystems. 2017 Mar 7;2(2). doi: 10.1128/mSystems.00191-16. eCollection 2017 Mar-Apr.
2
VSEARCH: a versatile open source tool for metagenomics.
PeerJ. 2016 Oct 18;4:e2584. doi: 10.7717/peerj.2584. eCollection 2016.
3
DADA2: High-resolution sample inference from Illumina amplicon data.
Nat Methods. 2016 Jul;13(7):581-3. doi: 10.1038/nmeth.3869. Epub 2016 May 23.
4
Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform.
Nucleic Acids Res. 2015 Mar 31;43(6):e37. doi: 10.1093/nar/gku1341. Epub 2015 Jan 13.
5
Trimmomatic: a flexible trimmer for Illumina sequence data.
Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1.
6
PEAR: a fast and accurate Illumina Paired-End reAd mergeR.
Bioinformatics. 2014 Mar 1;30(5):614-20. doi: 10.1093/bioinformatics/btt593. Epub 2013 Oct 18.
8
The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.
Nucleic Acids Res. 2013 Jan;41(Database issue):D590-6. doi: 10.1093/nar/gks1219. Epub 2012 Nov 28.
9
Snakemake--a scalable bioinformatics workflow engine.
Bioinformatics. 2012 Oct 1;28(19):2520-2. doi: 10.1093/bioinformatics/bts480. Epub 2012 Aug 20.
10
FLASH: fast length adjustment of short reads to improve genome assemblies.
Bioinformatics. 2011 Nov 1;27(21):2957-63. doi: 10.1093/bioinformatics/btr507. Epub 2011 Sep 7.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验