FastqPuri：RNA-seq 数据的高性能预处理。

FastqPuri: high-performance preprocessing of RNA-seq data.

机构信息

Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Am BioPark 9, Regensburg, 93053, Germany.

Department of Marine Microbiology and Biogeochemistry, NIOZ Royal Netherlands Institute for Sea Research and Utrecht University, P.O. Box 59, Den Burg, 1790 AB, The Netherlands.

出版信息

BMC Bioinformatics. 2019 May 3;20(1):226. doi: 10.1186/s12859-019-2799-0.

DOI:10.1186/s12859-019-2799-0

PMID:31053060

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6500068/

Abstract

BACKGROUND

RNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in data analysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data, such as running quality control and adapter, contamination and quality filtering before transcript or gene quantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software that covers all preprocessing steps is currently missing.

RESULTS

We here present FastqPuri, a light-weight and highly efficient preprocessing tool for fastq data. FastqPuri provides sequence quality reports on the sample and dataset level with new plots which facilitate decision making for subsequent quality filtering. Moreover, FastqPuri efficiently removes adapter sequences and sequences from biological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressed fastq files. FastqPuri can be run stand-alone and is suitable to be run within pipelines. We benchmarked FastqPuri against existing tools and found that FastqPuri is superior in terms of speed, memory usage, versatility and comprehensiveness.

CONCLUSIONS

FastqPuri is a new tool which covers all aspects of short read sequence data preprocessing. It was designed for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and gene counting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, such as for genome assembly or SNV (single nucleotide variant) detection. FastqPuri is most flexible in filtering undesired biological sequences by offering two approaches to optimize speed and memory usage dependent on the total size of the potential contaminating sequences. FastqPuri is available at https://github.com/jengelmann/FastqPuri . It is implemented in C and R and licensed under GPL v3.

摘要

背景

RNA 测序（RNA-seq）已成为分析高通量基因和转录本表达的标准手段。虽然以前序列比对是一个耗时的步骤，但快速的比对方法，甚至更快速的转录本计数方法，通过评估读取是否与转录本兼容来避免映射和量化基因和转录本表达，已经显著加快了数据分析的速度。现在，RNA-seq 数据分析中最耗时的步骤是预处理原始序列数据，例如在进行转录本或基因定量之前，运行质量控制和适配器、污染和质量过滤。为此，许多研究人员会链式使用不同的工具，但目前缺少一种全面、灵活且快速的软件，涵盖所有预处理步骤。

结果

我们在这里介绍了 FastqPuri，这是一种用于快速测序数据的轻量级且高效的预处理工具。FastqPuri 提供了样本和数据集级别的序列质量报告，并提供了新的图表，方便了对后续质量过滤的决策。此外，FastqPuri 可以有效地从数据中去除适配器序列和生物污染序列。它接受未压缩或压缩的 fastq 文件中的单端和双端数据。FastqPuri 可以独立运行，也适合在管道中运行。我们对 FastqPuri 与现有工具进行了基准测试，发现 FastqPuri 在速度、内存使用、多功能性和全面性方面都具有优势。

结论

FastqPuri 是一种新的工具，涵盖了短读序列数据预处理的各个方面。它是为 RNA-seq 数据设计的，旨在满足快速预处理 fastq 数据以允许转录本和基因计数的需求，但它也适合处理任何需要高质量序列的短读测序数据，例如基因组组装或单核苷酸变异（SNV）检测。FastqPuri 提供了两种方法来优化速度和内存使用，具体取决于潜在污染序列的总大小，从而在过滤不需要的生物序列方面具有最大的灵活性。FastqPuri 可在 https://github.com/jengelmann/FastqPuri 上获得。它是用 C 和 R 实现的，并根据 GPL v3 许可。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ddf1/6500068/c2e83938c198/12859_2019_2799_Fig1_HTML.jpg

相似文献

FastqPuri: high-performance preprocessing of RNA-seq data.

BMC Bioinformatics. 2019 May 3;20(1):226. doi: 10.1186/s12859-019-2799-0.

RNA-QC-chain: comprehensive and fast quality control for RNA-Seq data.

BMC Genomics. 2018 Feb 14;19(1):144. doi: 10.1186/s12864-018-4503-6.

A Guide for Designing and Analyzing RNA-Seq Data.

Methods Mol Biol. 2018;1783:35-80. doi: 10.1007/978-1-4939-7834-2_3.

scruff: an R/Bioconductor package for preprocessing single-cell RNA-sequencing data.

BMC Bioinformatics. 2019 May 2;20(1):222. doi: 10.1186/s12859-019-2797-2.

Grape RNA-Seq analysis pipeline environment.

Bioinformatics. 2013 Mar 1;29(5):614-21. doi: 10.1093/bioinformatics/btt016. Epub 2013 Jan 17.

SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis.

BMC Bioinformatics. 2016 Feb 4;17:66. doi: 10.1186/s12859-016-0923-y.

Performance evaluation of lossy quality compression algorithms for RNA-seq data.

BMC Bioinformatics. 2020 Jul 20;21(1):321. doi: 10.1186/s12859-020-03658-4.

scNPF: an integrative framework assisted by network propagation and network fusion for preprocessing of single-cell RNA-seq data.

BMC Genomics. 2019 May 8;20(1):347. doi: 10.1186/s12864-019-5747-5.

SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines.

BMC Bioinformatics. 2017 Sep 29;18(1):428. doi: 10.1186/s12859-017-1831-5.

OSA: a fast and accurate alignment tool for RNA-Seq.

Bioinformatics. 2012 Jul 15;28(14):1933-4. doi: 10.1093/bioinformatics/bts294. Epub 2012 May 15.

引用本文的文献

Active bacteria driving N2O mitigation and dissimilatory nitrate reduction to ammonium in ammonia recovery bioreactors.

ISME J. 2025 Jan 2;19(1). doi: 10.1093/ismejo/wraf021.

Expression of Lymphoid Enhancer-Binding Factor 1 in Cancer-Associated Fibroblasts Mediates Tumor Growth and Transdifferentiation Toward Squamous Cell Carcinoma in Human Breast Cancer.

Cancer Med. 2025 Feb;14(3):e70627. doi: 10.1002/cam4.70627.

Transcriptome analysis of the common moss grown under Antarctic field condition.

AoB Plants. 2024 Aug 10;16(5):plae043. doi: 10.1093/aobpla/plae043. eCollection 2024 Oct.

Genomic variation in Plasmodium relictum (lineage SGS1) and its implications for avian malaria infection outcomes: insights from experimental infections and genome-wide analysis.

Malar J. 2024 Aug 29;23(1):260. doi: 10.1186/s12936-024-05061-3.

Increased RUNX3 expression mediates tumor-promoting ability of human breast cancer-associated fibroblasts.

Cancer Med. 2023 Sep;12(17):18062-18077. doi: 10.1002/cam4.6421. Epub 2023 Aug 28.

Not out of the Mediterranean: Atlantic populations of the gorgonian are a separate sister species under further lineage diversification.

Ecol Evol. 2023 Jan 29;13(1):e9740. doi: 10.1002/ece3.9740. eCollection 2023 Jan.

Integrated Transcriptomic and Metabolomic Analysis of the Mechanism of Foliar Application of Hormone-Type Growth Regulator in the Improvement of Grape ( L.) Coloration in Saline-Alkaline Soil.

Plants (Basel). 2022 Aug 15;11(16):2115. doi: 10.3390/plants11162115.

Bulked Segregant RNA Sequencing Revealed Difference Between Virulent and Avirulent Brown Planthoppers.

Front Plant Sci. 2022 Apr 14;13:843227. doi: 10.3389/fpls.2022.843227. eCollection 2022.

Low Toxicological Impact of Commercial Pristine Multi-Walled Carbon Nanotubes on the Yeast .

Nanomaterials (Basel). 2021 Sep 1;11(9):2272. doi: 10.3390/nano11092272.

ZWA: Viral genome assembly and characterization hindrances from virus-host chimeric reads; a refining approach.

PLoS Comput Biol. 2021 Aug 9;17(8):e1009304. doi: 10.1371/journal.pcbi.1009304. eCollection 2021 Aug.

本文引用的文献

fastp: an ultra-fast all-in-one FASTQ preprocessor.

Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560.

FastQ Screen: A tool for multi-genome mapping and quality control.

F1000Res. 2018 Aug 24;7:1338. doi: 10.12688/f1000research.15931.2. eCollection 2018.

RNA-QC-chain: comprehensive and fast quality control for RNA-Seq data.

BMC Genomics. 2018 Feb 14;19(1):144. doi: 10.1186/s12864-018-4503-6.

AfterQC: automatic filtering, trimming, error removing and quality control for fastq data.

BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):80. doi: 10.1186/s12859-017-1469-3.

Patterns of cross-contamination in a multispecies population genomic project: detection, quantification, impact, and solutions.

BMC Biol. 2017 Mar 29;15(1):25. doi: 10.1186/s12915-017-0366-6.

Salmon provides fast and bias-aware quantification of transcript expression.

Nat Methods. 2017 Apr;14(4):417-419. doi: 10.1038/nmeth.4197. Epub 2017 Mar 6.

Comparative evaluation of rRNA depletion procedures for the improved analysis of bacterial biofilm and mixed pathogen culture transcriptomes.

Sci Rep. 2017 Jan 24;7:41114. doi: 10.1038/srep41114.

Near-optimal probabilistic RNA-seq quantification.

Nat Biotechnol. 2016 May;34(5):525-7. doi: 10.1038/nbt.3519. Epub 2016 Apr 4.

QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments.

BMC Bioinformatics. 2015 Jul 19;16(1):224. doi: 10.1186/s12859-015-0670-5.

Polyester: simulating RNA-seq datasets with differential transcript expression.

Bioinformatics. 2015 Sep 1;31(17):2778-84. doi: 10.1093/bioinformatics/btv272. Epub 2015 Apr 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

FastqPuri：RNA-seq 数据的高性能预处理。

FastqPuri: high-performance preprocessing of RNA-seq data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献