迈向基因集富集分析基准测试的金标准。

Toward a gold standard for benchmarking gene set enrichment analysis.

机构信息

Graduate School of Public Health and Health Policy, City University of New York, New York, NY 10027, USA.

Institute for Implementation Science and Population Health, City University of New York, New York, NY 10027, USA.

出版信息

Brief Bioinform. 2021 Jan 18;22(1):545-556. doi: 10.1093/bib/bbz158.

DOI:10.1093/bib/bbz158

PMID:32026945

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7820859/

Abstract

MOTIVATION

Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets.

RESULTS

We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance.

AVAILABILITY

http://bioconductor.org/packages/GSEABenchmarkeR.

CONTACT

ludwig.geistlinger@sph.cuny.edu.

摘要

动机

尽管基因集富集分析已成为高通量基因表达数据分析不可或缺的一部分，但对富集方法的评估仍然是初步和特定的。在缺乏合适的黄金标准的情况下，评估通常仅限于选定的数据集和基于相关基因集的生物学推理。

结果

我们开发了一个可扩展的框架，用于基于适用性、基因集优先级和检测相关过程的定义标准，对富集方法进行可重复的基准测试。该框架包含了一个经过精心整理的 75 个人类疾病研究的表达数据集的汇编。该汇编包括微阵列和 RNA-seq 测量，每个数据集都与相应疾病的预先编译的 GO/KEGG 相关性排名相关联。我们对 10 种主要的富集方法进行了全面评估，确定了它们在运行时间和 RNA-seq 数据适用性方面的显著差异、根据测试的零假设而变化的富集基因集比例，以及对预定义相关性排名的恢复程度。我们提出了如何有效地将最初为微阵列数据开发的方法应用于 RNA-seq 数据的实际建议，如何根据进行的基因集测试类型来解释结果，以及哪些方法最适合有效地对具有高表型相关性的基因集进行优先级排序。

可用性

http://bioconductor.org/packages/GSEABenchmarkeR。

联系方式

ludwig.geistlinger@sph.cuny.edu。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17b7/7820859/9ecda56579c2/bbz158f1.jpg

相似文献

Toward a gold standard for benchmarking gene set enrichment analysis.迈向基因集富集分析基准测试的金标准。

Brief Bioinform. 2021 Jan 18;22(1):545-556. doi: 10.1093/bib/bbz158.

Assessment of Gene Set Enrichment Analysis using curated RNA-seq-based benchmarks.基于 RNA-seq 验证集的基因集富集分析评估。

PLoS One. 2024 May 16;19(5):e0302696. doi: 10.1371/journal.pone.0302696. eCollection 2024.

Combining multiple tools outperforms individual methods in gene set enrichment analyses.在基因集富集分析中，结合多种工具比单独使用方法表现更优。

Bioinformatics. 2017 Feb 1;33(3):414-424. doi: 10.1093/bioinformatics/btw623.

compcodeR--an R package for benchmarking differential expression methods for RNA-seq data.compcodeR——一个用于对RNA测序数据差异表达方法进行基准测试的R软件包。

Bioinformatics. 2014 Sep 1;30(17):2517-8. doi: 10.1093/bioinformatics/btu324. Epub 2014 May 9.

Benchmarking RNA-seq differential expression analysis methods using spike-in and simulation data.使用 Spike-in 和模拟数据进行 RNA-seq 差异表达分析方法的基准测试。

PLoS One. 2020 Apr 30;15(4):e0232271. doi: 10.1371/journal.pone.0232271. eCollection 2020.

Seten: a tool for systematic identification and comparison of processes, phenotypes, and diseases associated with RNA-binding proteins from condition-specific CLIP-seq profiles.Seten：一种用于从特定条件的CLIP-seq图谱中系统识别和比较与RNA结合蛋白相关的过程、表型和疾病的工具。

RNA. 2017 Jun;23(6):836-846. doi: 10.1261/rna.059089.116. Epub 2017 Mar 23.

A real-world multi-center RNA-seq benchmarking study using the Quartet and MAQC reference materials.基于 Quartet 和 MAQC 参考品的真实世界多中心 RNA-seq 基准研究。

Nat Commun. 2024 Jul 22;15(1):6167. doi: 10.1038/s41467-024-50420-y.

Silver: Forging almost Gold Standard Datasets.银：锻造近乎黄金标准数据集。

Genes (Basel). 2021 Sep 28;12(10):1523. doi: 10.3390/genes12101523.

Analysis of RNA Sequencing Data Using CLC Genomics Workbench.使用CLC基因组学工作台分析RNA测序数据。

Methods Mol Biol. 2020;2102:61-113. doi: 10.1007/978-1-0716-0223-2_4.

Benchmarking enrichment analysis methods with the disease pathway network.使用疾病通路网络对富集分析方法进行基准测试。

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae069.

引用本文的文献

Weighted overlapping group lasso for integrating prior network knowledge into gene set analysis.用于将先验网络知识整合到基因集分析中的加权重叠组套索法。

BMC Bioinformatics. 2025 Sep 1;26(1):226. doi: 10.1186/s12859-025-06170-9.

Spectral divergence prioritizes key classes, genes, and pathways shared between substance use disorders and cardiovascular disease.光谱散度对物质使用障碍和心血管疾病之间共有的关键类别、基因和通路进行了优先排序。

Front Neurosci. 2025 Jul 22;19:1572243. doi: 10.3389/fnins.2025.1572243. eCollection 2025.

Cluefish: mining the dark matter of transcriptional data series with over-representation analysis enhanced by aggregated biological prior knowledge.线索鱼：利用聚合生物学先验知识增强的过度表达分析挖掘转录数据系列的暗物质。

NAR Genom Bioinform. 2025 Jul 30;7(3):lqaf103. doi: 10.1093/nargab/lqaf103. eCollection 2025 Sep.

GeneAgent: self-verification language agent for gene-set analysis using domain databases.基因智能体：使用领域数据库进行基因集分析的自我验证语言智能体。

Nat Methods. 2025 Jul 28. doi: 10.1038/s41592-025-02748-6.

Efficient gene set analysis for DNA methylation addressing probe dependency and bias.针对DNA甲基化的高效基因集分析，解决探针依赖性和偏差问题。

Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf422.

Utilizing genomics to identify novel immunotherapeutic targets in multiple myeloma high-risk subgroups.利用基因组学鉴定多发性骨髓瘤高危亚组中的新型免疫治疗靶点。

Genome Med. 2025 Jul 15;17(1):79. doi: 10.1186/s13073-025-01503-y.

Snail family transcriptional repressor 1 radiosensitizes esophageal cancer via epithelial-mesenchymal transition signaling: From bioinformatics to integrated study.蜗牛家族转录抑制因子1通过上皮-间质转化信号通路使食管癌对放疗敏感：从生物信息学到整合研究

World J Gastrointest Oncol. 2025 Apr 15;17(4):97644. doi: 10.4251/wjgo.v17.i4.97644.

Transcriptomic analysis reveals adipogenesis in the uterosacral ligaments of postmenopausal women with recurrent pelvic organ prolapse.转录组分析揭示了复发性盆腔器官脱垂绝经后女性子宫骶韧带中的脂肪生成。

Zhong Nan Da Xue Xue Bao Yi Xue Ban. 2024 Nov 28;49(11):1808-1820. doi: 10.11817/j.issn.1672-7347.2024.230308.

GeneCOCOA: Detecting context-specific functions of individual genes using co-expression data.基因COCOA：利用共表达数据检测单个基因的上下文特异性功能。

PLoS Comput Biol. 2025 Mar 31;21(3):e1012278. doi: 10.1371/journal.pcbi.1012278. eCollection 2025.

A workflow for human health hazard evaluation using transcriptomic data and Key Characteristics-based gene sets.一种使用转录组数据和基于关键特征的基因集进行人类健康危害评估的工作流程。

Toxicol Sci. 2025 Jun 1;205(2):310-325. doi: 10.1093/toxsci/kfaf036.

本文引用的文献

Identifying significantly impacted pathways: a comprehensive review and assessment.识别受显著影响的途径：全面回顾与评估。

Genome Biol. 2019 Oct 9;20(1):203. doi: 10.1186/s13059-019-1790-4.

Simultaneous Enrichment Analysis of all Possible Gene-sets: Unifying Self-Contained and Competitive Methods.同时富集分析所有可能的基因集：统一自包含和竞争方法。

Brief Bioinform. 2020 Jul 15;21(4):1302-1312. doi: 10.1093/bib/bbz074.

A practical guide to methods controlling false discoveries in computational biology.计算生物学中控制假发现方法的实用指南。

Genome Biol. 2019 Jun 4;20(1):118. doi: 10.1186/s13059-019-1716-1.

Single sample scoring of molecular phenotypes.单样本分子表型评分。

BMC Bioinformatics. 2018 Nov 6;19(1):404. doi: 10.1186/s12859-018-2435-4.

Network-Based Approaches for Pathway Level Analysis.基于网络的通路水平分析方法。

Curr Protoc Bioinformatics. 2018 Mar;61(1):8.25.1-8.25.24. doi: 10.1002/cpbi.42.

Comprehensive Characterization of Cancer Driver Genes and Mutations.全面描绘癌症驱动基因和突变。

Cell. 2018 Apr 5;173(2):371-385.e18. doi: 10.1016/j.cell.2018.02.060.

Oncogenic Signaling Pathways in The Cancer Genome Atlas.癌症基因组图谱中的致癌信号通路。

Cell. 2018 Apr 5;173(2):321-337.e10. doi: 10.1016/j.cell.2018.03.035.

Ranking metrics in gene set enrichment analysis: do they matter?基因集富集分析中的排名指标：它们重要吗？

BMC Bioinformatics. 2017 May 12;18(1):256. doi: 10.1186/s12859-017-1674-0.

Combining multiple tools outperforms individual methods in gene set enrichment analyses.在基因集富集分析中，结合多种工具比单独使用方法表现更优。

Bioinformatics. 2017 Feb 1;33(3):414-424. doi: 10.1093/bioinformatics/btw623.

GeneAnalytics: An Integrative Gene Set Analysis Tool for Next Generation Sequencing, RNAseq and Microarray Data.基因分析：一种用于下一代测序、RNA测序和微阵列数据的综合基因集分析工具。

OMICS. 2016 Mar;20(3):139-51. doi: 10.1089/omi.2015.0168.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

迈向基因集富集分析基准测试的金标准。

Toward a gold standard for benchmarking gene set enrichment analysis.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

动机

结果

可用性

联系方式

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献