基于 RNA-seq 验证集的基因集富集分析评估。
Assessment of Gene Set Enrichment Analysis using curated RNA-seq-based benchmarks.
机构信息
Longitudinal Studies Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Baltimore, MD, United States of America.
出版信息
PLoS One. 2024 May 16;19(5):e0302696. doi: 10.1371/journal.pone.0302696. eCollection 2024.
Pathway enrichment analysis is a ubiquitous computational biology method to interpret a list of genes (typically derived from the association of large-scale omics data with phenotypes of interest) in terms of higher-level, predefined gene sets that share biological function, chromosomal location, or other common features. Among many tools developed so far, Gene Set Enrichment Analysis (GSEA) stands out as one of the pioneering and most widely used methods. Although originally developed for microarray data, GSEA is nowadays extensively utilized for RNA-seq data analysis. Here, we quantitatively assessed the performance of a variety of GSEA modalities and provide guidance in the practical use of GSEA in RNA-seq experiments. We leveraged harmonized RNA-seq datasets available from The Cancer Genome Atlas (TCGA) in combination with large, curated pathway collections from the Molecular Signatures Database to obtain cancer-type-specific target pathway lists across multiple cancer types. We carried out a detailed analysis of GSEA performance using both gene-set and phenotype permutations combined with four different choices for the Kolmogorov-Smirnov enrichment statistic. Based on our benchmarks, we conclude that the classic/unweighted gene-set permutation approach offered comparable or better sensitivity-vs-specificity tradeoffs across cancer types compared with other, more complex and computationally intensive permutation methods. Finally, we analyzed other large cohorts for thyroid cancer and hepatocellular carcinoma. We utilized a new consensus metric, the Enrichment Evidence Score (EES), which showed a remarkable agreement between pathways identified in TCGA and those from other sources, despite differences in cancer etiology. This finding suggests an EES-based strategy to identify a core set of pathways that may be complemented by an expanded set of pathways for downstream exploratory analysis. This work fills the existing gap in current guidelines and benchmarks for the use of GSEA with RNA-seq data and provides a framework to enable detailed benchmarking of other RNA-seq-based pathway analysis tools.
通路富集分析是一种常用的计算生物学方法,用于根据生物学功能、染色体位置或其他共同特征,将大规模组学数据与感兴趣的表型相关联而得到的基因列表(通常来源于这些基因列表),解释为更高级的预定义基因集。在迄今为止开发的众多工具中,基因集富集分析(GSEA)是最具开创性和应用最广泛的方法之一。尽管 GSEA 最初是为微阵列数据开发的,但现在已广泛用于 RNA-seq 数据分析。在这里,我们定量评估了各种 GSEA 模式的性能,并为在 RNA-seq 实验中实际使用 GSEA 提供了指导。我们利用来自癌症基因组图谱(TCGA)的协调 RNA-seq 数据集,结合来自分子特征数据库(Molecular Signatures Database)的大型、经过精心整理的途径集合,获得了多种癌症类型的癌症特异性靶向途径列表。我们使用基因集和表型置换的组合,并结合 Kolmogorov-Smirnov 富集统计量的四种不同选择,对 GSEA 性能进行了详细分析。基于我们的基准,我们得出结论,与其他更复杂和计算密集型的置换方法相比,经典/非加权基因集置换方法在癌症类型之间提供了可比或更好的敏感性-特异性权衡。最后,我们分析了甲状腺癌和肝细胞癌的其他大型队列。我们利用了一种新的共识度量标准,即富集证据评分(EES),该标准显示了 TCGA 中鉴定的途径与其他来源的途径之间的显著一致性,尽管癌症病因存在差异。这一发现表明,基于 EES 的策略可以识别一组核心途径,这些途径可以通过一组扩展的途径进行下游探索性分析来补充。这项工作填补了当前 GSEA 与 RNA-seq 数据使用指南和基准之间的空白,并为其他基于 RNA-seq 的途径分析工具的详细基准测试提供了框架。