RNA测序数据基因集分析方法的比较评估

Comparative evaluation of gene set analysis approaches for RNA-Seq data.

作者信息

Rahmatallah Yasir, Emmert-Streib Frank, Glazko Galina

机构信息

Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.

Computational Biology and Machine Learning Laboratory, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, 97 Lisburn Road, Belfast, BT9 7BL, UK.

出版信息

BMC Bioinformatics. 2014 Dec 5;15(1):397. doi: 10.1186/s12859-014-0397-8.

DOI:10.1186/s12859-014-0397-8

PMID:25475910

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4265362/

Abstract

BACKGROUND

Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNA-Seq experiments, the limitations of GSA methods developed for microarrays in the context of RNA-Seq data are not well understood.

RESULTS

We provide a thorough evaluation of popular multivariate and gene-level self-contained GSA approaches on simulated and real RNA-Seq data. The multivariate approach employs multivariate non-parametric tests combined with popular normalizations for RNA-Seq data. The gene-level approach utilizes univariate tests designed for the analysis of RNA-Seq data to find gene-specific P-values and combines them into a pathway P-value using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of gene-level GSA tests are heavily affected by the methods for combining P-values, and all aforementioned biases are present in detected pathways.

CONCLUSIONS

Our result emphasizes the importance of using self-contained non-parametric multivariate tests for detecting differentially expressed pathways for RNA-Seq data and warns against applying gene-level GSA tests, especially because of their high level of Type I error rates for both, simulated and real data.

摘要

背景

在过去几年中，转录组测序（RNA-Seq）几乎完全取代了微阵列用于基因表达的高通量研究。目前，RNA-Seq最常见的用途是识别在两种或更多条件之间差异表达的基因。尽管基因集分析（GSA）在解释RNA-Seq实验结果中很重要，但针对微阵列开发的GSA方法在RNA-Seq数据背景下的局限性尚未得到充分理解。

结果

我们对模拟和真实RNA-Seq数据上流行的多变量和基因水平自包含GSA方法进行了全面评估。多变量方法采用多变量非参数检验并结合RNA-Seq数据的流行归一化方法。基因水平方法利用为分析RNA-Seq数据而设计的单变量检验来找到基因特异性P值，并使用经典统计技术将它们组合成通路P值。我们的结果表明，I型错误率和多变量检验的功效仅取决于检验统计量，并且对不同的归一化不敏感。一般来说，标准的多变量GSA检验检测的通路在通路大小、差异表达基因的百分比或通路中的平均基因长度方面没有任何偏差。相比之下，基因水平GSA检验的I型错误率和功效受到P值组合方法的严重影响，并且在检测到的通路中存在所有上述偏差。

结论

我们的结果强调了使用自包含非参数多变量检验来检测RNA-Seq数据中差异表达通路的重要性，并警告不要应用基因水平GSA检验，特别是因为它们在模拟和真实数据中都有很高的I型错误率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/177a/4265362/028f99a6e0dd/12859_2014_397_Fig1_HTML.jpg

相似文献

Comparative evaluation of gene set analysis approaches for RNA-Seq data.

BMC Bioinformatics. 2014 Dec 5;15(1):397. doi: 10.1186/s12859-014-0397-8.

Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline.

Brief Bioinform. 2016 May;17(3):393-407. doi: 10.1093/bib/bbv069. Epub 2015 Sep 4.

SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis.

BMC Bioinformatics. 2016 Feb 4;17:66. doi: 10.1186/s12859-016-0923-y.

Synthetic data sets for the identification of key ingredients for RNA-seq differential analysis.

Brief Bioinform. 2018 Jan 1;19(1):65-76. doi: 10.1093/bib/bbw092.

Statistical detection of differentially expressed genes based on RNA-seq: from biological to phylogenetic replicates.

Brief Bioinform. 2016 Mar;17(2):243-8. doi: 10.1093/bib/bbv035. Epub 2015 Jun 24.

Detecting Multivariate Gene Interactions in RNA-Seq Data Using Optimal Bayesian Classification.

IEEE/ACM Trans Comput Biol Bioinform. 2018 Mar-Apr;15(2):484-493. doi: 10.1109/TCBB.2015.2485223. Epub 2015 Oct 1.

ToPASeq: an R package for topology-based pathway analysis of microarray and RNA-Seq data.

BMC Bioinformatics. 2015 Oct 29;16:350. doi: 10.1186/s12859-015-0763-1.

Detection of high variability in gene expression from single-cell RNA-seq profiling.

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):508. doi: 10.1186/s12864-016-2897-6.

rSeqNP: a non-parametric approach for detecting differential expression and splicing from RNA-Seq data.

Bioinformatics. 2015 Jul 1;31(13):2222-4. doi: 10.1093/bioinformatics/btv119. Epub 2015 Feb 24.

LFCseq: a nonparametric approach for differential expression analysis of RNA-seq data.

BMC Genomics. 2014;15 Suppl 10(Suppl 10):S7. doi: 10.1186/1471-2164-15-S10-S7. Epub 2014 Dec 12.

引用本文的文献

Significant up-regulation of Toll-like receptor (TLR) signaling pathway in Epstein-Barr virus-associated gastric cancer.

Int J Mol Epidemiol Genet. 2025 Feb 25;16(1):1-8. doi: 10.62347/RIOX7768. eCollection 2025.

inhibits cancer cells growth by inducing G2/M arrest.

Front Pharmacol. 2023 Mar 17;14:1121799. doi: 10.3389/fphar.2023.1121799. eCollection 2023.

A Framework for Comparison and Assessment of Synthetic RNA-Seq Data.

Genes (Basel). 2022 Dec 14;13(12):2362. doi: 10.3390/genes13122362.

Statistical Approach of Gene Set Analysis with Quantitative Trait Loci for Crop Gene Expression Studies.

Entropy (Basel). 2021 Jul 23;23(8):945. doi: 10.3390/e23080945.

Data-driven detection of subtype-specific differentially expressed genes.

Sci Rep. 2021 Jan 11;11(1):332. doi: 10.1038/s41598-020-79704-1.

Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges.

Entropy (Basel). 2020 Apr 10;22(4):427. doi: 10.3390/e22040427.

Toward a gold standard for benchmarking gene set enrichment analysis.

Brief Bioinform. 2021 Jan 18;22(1):545-556. doi: 10.1093/bib/bbz158.

Proteome-transcriptome alignment of molecular portraits achieved by self-contained gene set analysis: Consensus colon cancer subtypes case study.

PLoS One. 2019 Aug 22;14(8):e0221444. doi: 10.1371/journal.pone.0221444. eCollection 2019.

Simultaneous Enrichment Analysis of all Possible Gene-sets: Unifying Self-Contained and Competitive Methods.

Brief Bioinform. 2020 Jul 15;21(4):1302-1312. doi: 10.1093/bib/bbz074.

Probabilistic prioritization of candidate pathway association with pathway score.

BMC Bioinformatics. 2018 Oct 24;19(1):391. doi: 10.1186/s12859-018-2411-z.

本文引用的文献

voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.

Genome Biol. 2014 Feb 3;15(2):R29. doi: 10.1186/gb-2014-15-2-r29.

Soft truncation thresholding for gene set analysis of RNA-seq data: application to a vaccine study.

Sci Rep. 2013 Oct 9;3:2898. doi: 10.1038/srep02898.

Gene set enrichment analysis of RNA-Seq data: integrating differential expression and splicing.

BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S16. doi: 10.1186/1471-2105-14-S5-S16. Epub 2013 Apr 10.

Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods.

Nucleic Acids Res. 2013 Apr;41(8):4378-91. doi: 10.1093/nar/gkt111. Epub 2013 Feb 26.

Ensuring the statistical soundness of competitive gene set approaches: gene filtering and genome-scale coverage are essential.

Nucleic Acids Res. 2013 Apr;41(7):e82. doi: 10.1093/nar/gkt054. Epub 2013 Feb 6.

GSVA: gene set variation analysis for microarray and RNA-seq data.

BMC Bioinformatics. 2013 Jan 16;14:7. doi: 10.1186/1471-2105-14-7.

Gene set analysis for self-contained tests: complex null and specific alternative hypotheses.

Bioinformatics. 2012 Dec 1;28(23):3073-80. doi: 10.1093/bioinformatics/bts579. Epub 2012 Oct 7.

A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.

Brief Bioinform. 2013 Nov;14(6):671-83. doi: 10.1093/bib/bbs046. Epub 2012 Sep 17.

Ten years of pathway analysis: current approaches and outstanding challenges.

PLoS Comput Biol. 2012;8(2):e1002375. doi: 10.1371/journal.pcbi.1002375. Epub 2012 Feb 23.

Removing technical variability in RNA-seq data using conditional quantile normalization.

Biostatistics. 2012 Apr;13(2):204-16. doi: 10.1093/biostatistics/kxr054. Epub 2012 Jan 27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

RNA测序数据基因集分析方法的比较评估

Comparative evaluation of gene set analysis approaches for RNA-Seq data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献