不同流程版本中的工具组合如何影响RNA测序分析的结果。

How tool combinations in different pipeline versions affect the outcome in RNA-seq analysis.

作者信息

Perelo Louisa Wessels, Gabernet Gisela, Straub Daniel, Nahnsen Sven

机构信息

Quantitative Biology Center (QBiC), University of Tübingen, Otfried-Müller-Str. 37, 72076 Tübingen, Baden-Württemberg, 72076, Germany.

M3 Research Center, Faculty of Medicine, University of Tübingen, Otfried-Müller-Str. 37, 72076 Tübingen, Baden-Württemberg, 72076, Germany.

出版信息

NAR Genom Bioinform. 2024 Mar 7;6(1):lqae020. doi: 10.1093/nargab/lqae020. eCollection 2024 Mar.

DOI:10.1093/nargab/lqae020

PMID:38456178

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10919883/

Abstract

Data analysis tools are continuously changed and improved over time. In order to test how these changes influence the comparability between analyses, the output of different workflow options of the nf-core/rnaseq pipeline were compared. Five different pipeline settings (STAR+Salmon, STAR+RSEM, STAR+featureCounts, HISAT2+featureCounts, pseudoaligner Salmon) were run on three datasets (human, Arabidopsis, zebrafish) containing spike-ins of the External RNA Control Consortium (ERCC). Fold change ratios and differential expression of genes and spike-ins were used for comparative analyses of the different tools and versions settings of the pipeline. An overlap of 85% for differential gene classification between pipelines could be shown. Genes interpreted with a bias were mostly those present at lower concentration. Also, the number of isoforms and exons per gene were determinants. Previous pipeline versions using featureCounts showed a higher sensitivity to detect one-isoform genes like ERCC. To ensure data comparability in long-term analysis series it would be recommendable to either stay with the pipeline version the series was initialized with or to run both versions during a transition time in order to ensure that the target genes are addressed the same way.

摘要

随着时间的推移，数据分析工具不断变化和改进。为了测试这些变化如何影响分析之间的可比性，对nf-core/rnaseq管道不同工作流程选项的输出进行了比较。在包含外部RNA对照联盟（ERCC）加标的三个数据集（人类、拟南芥、斑马鱼）上运行了五种不同的管道设置（STAR+Salmon、STAR+RSEM、STAR+featureCounts、HISAT2+featureCounts、伪比对器Salmon）。基因和加标的倍数变化率以及差异表达用于对管道的不同工具和版本设置进行比较分析。可以显示不同管道之间差异基因分类的重叠率为85%。有偏差解释的基因大多是那些浓度较低的基因。此外，每个基因的异构体和外显子数量也是决定因素。以前使用featureCounts的管道版本在检测像ERCC这样的单异构体基因方面表现出更高的灵敏度。为了确保长期分析系列中的数据可比性，建议要么使用系列初始化时的管道版本，要么在过渡期间同时运行两个版本，以确保以相同的方式处理目标基因。