超越标准管道，通路富集分析中的 p < 0.05。

Beyond standard pipeline and p < 0.05 in pathway enrichment analyses.

机构信息

The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.

Litwin-Zucker Center for the study of Alzheimer's Disease, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA; Division of Geriatric Psychiatry, Zucker Hillside Hospital, Northwell Health, Glen Oaks, NY, USA.

出版信息

Comput Biol Chem. 2021 Jun;92:107455. doi: 10.1016/j.compbiolchem.2021.107455. Epub 2021 Feb 12.

DOI:10.1016/j.compbiolchem.2021.107455

PMID:33774420

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9179938/

Abstract

A standard pathway/gene-set enrichment analysis, the over-representation analysis, is based on four values: the size of two gene-sets, size of their overlap, and size of the gene universe from which the gene-sets are chosen. The standard result of such an analysis is based on the p-value of a statistical test. We supplement this standard pipeline by six cautions: (1) any p-value threshold to distinguish enriched gene-sets from not-enriched ones is to certain degree arbitrary; (2) genes in a gene-set may be correlated, which potentially overcount the gene-set size; (3) any attempt to impose multiple testing correction will increase the false negative rate; (4) gene-sets in a gene-set database may be correlated, potentially overcount the factor for multiple testing correction; (5) the discrete nature of the data make it possible that a minimum change in counts may lead to a quantum change in the p-value threshold-based conclusion; (6) the two gene-sets may not be chosen from the universe of all human genes, but in fact from a subset of that universe, or even two different subsets of all genes. Careful reconsideration of these issues can have an impact on an enrichment analysis conclusion. Part of our cautions mirror the call from statistician that reaching conclusion from data is not a simple matter of p-value smaller than 0.05, but a thoughtful process with due diligences.

摘要

标准的通路/基因集富集分析（over-representation analysis）基于四个数值：两个基因集的大小、它们的重叠大小，以及从中选择基因集的基因宇宙的大小。这种分析的标准结果基于统计检验的 p 值。我们通过六个注意事项来补充这个标准流程：（1）任何用于区分富集基因集和非富集基因集的 p 值阈值在某种程度上都是任意的；（2）基因集中的基因可能相关，这可能会过度计算基因集的大小；（3）任何尝试施加多重检验校正的尝试都会增加假阴性率；（4）基因集数据库中的基因集可能相关，可能会过度计算多重检验校正的因素；（5）数据的离散性质使得计数的微小变化可能导致基于 p 值阈值的结论发生量子变化；（6）这两个基因集可能不是从所有人类基因的宇宙中选择的，而是实际上是从该宇宙的一个子集，甚至是所有基因的两个不同子集选择的。仔细考虑这些问题可能会对富集分析的结论产生影响。我们的部分注意事项反映了统计学家的呼吁，即从数据中得出结论不仅仅是 p 值小于 0.05 的简单问题，而是一个需要深思熟虑和勤勉的过程。

相似文献

Beyond standard pipeline and p < 0.05 in pathway enrichment analyses.超越标准管道，通路富集分析中的 p < 0.05。

Comput Biol Chem. 2021 Jun;92:107455. doi: 10.1016/j.compbiolchem.2021.107455. Epub 2021 Feb 12.

Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets.多组大规模两样本表达数据集的一致整合基因集富集分析。

BMC Genomics. 2014;15 Suppl 1(Suppl 1):S6. doi: 10.1186/1471-2164-15-S1-S6. Epub 2014 Jan 24.

Gene expression analysis in clear cell renal cell carcinoma using gene set enrichment analysis for biostatistical management.基于基因集富集分析的 clear cell 肾细胞癌基因表达分析用于生物统计学管理。

BJU Int. 2011 Jul;108(2 Pt 2):E29-35. doi: 10.1111/j.1464-410X.2010.09794.x. Epub 2011 Mar 16.

Detecting discordance enrichment among a series of two-sample genome-wide expression data sets.检测一系列双样本全基因组表达数据集之间的不一致性富集情况。

BMC Genomics. 2017 Jan 25;18(Suppl 1):1050. doi: 10.1186/s12864-016-3265-2.

GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists.GOrilla：一种用于在排序后的基因列表中发现和可视化富集的基因本体（GO）术语的工具。

BMC Bioinformatics. 2009 Feb 3;10:48. doi: 10.1186/1471-2105-10-48.

Comparative study of gene set enrichment methods.基因集富集方法的比较研究。

BMC Bioinformatics. 2009 Sep 2;10:275. doi: 10.1186/1471-2105-10-275.

Testing gene set enrichment for subset of genes: Sub-GSE.针对基因子集进行基因集富集分析：子基因集富集分析（Sub-GSE）。

BMC Bioinformatics. 2008 Sep 2;9:362. doi: 10.1186/1471-2105-9-362.

How to decide which are the most pertinent overly-represented features during gene set enrichment analysis.如何在基因集富集分析中确定哪些是最相关的过度表达特征。

BMC Bioinformatics. 2007 Sep 11;8:332. doi: 10.1186/1471-2105-8-332.

GOAT: efficient and robust identification of gene set enrichment.GOAT：高效稳健的基因集富集识别。

Commun Biol. 2024 Jun 19;7(1):744. doi: 10.1038/s42003-024-06454-5.

Using set theory to reduce redundancy in pathway sets.运用集合论减少通路集的冗余。

BMC Bioinformatics. 2018 Oct 19;19(1):386. doi: 10.1186/s12859-018-2355-3.

引用本文的文献

Osmolality as a strong predictor of COVID-19 mortality and its possible links to other biomarkers.渗透压作为新冠病毒疾病死亡率的有力预测指标及其与其他生物标志物的可能联系。

PLoS One. 2025 Sep 16;20(9):e0331344. doi: 10.1371/journal.pone.0331344. eCollection 2025.

DNA-methylation markers associated with lung function at birth and childhood reveal early life programming of inflammatory pathways.与出生时及儿童期肺功能相关的DNA甲基化标记揭示了炎症通路的早期生命编程。

bioRxiv. 2025 May 14:2025.05.12.653131. doi: 10.1101/2025.05.12.653131.

Metabolomics biomarkers of frailty: a longitudinal study of aging female and male mice.衰弱的代谢组学生物标志物：对衰老雌性和雄性小鼠的纵向研究

NPJ Aging. 2025 May 23;11(1):40. doi: 10.1038/s41514-025-00237-w.

Deciphering the Role of CD14 in -associated Gastritis and Gastric Cancer: Combing Bioinformatics Analysis and Experiments.解析CD14在相关性胃炎和胃癌中的作用：结合生物信息学分析与实验

J Cancer. 2025 Mar 3;16(6):1918-1933. doi: 10.7150/jca.106847. eCollection 2025.

A Cyclic Permutation Approach to Removing Spatial Dependency between Clustered Gene Ontology Terms.一种用于消除聚类基因本体术语之间空间依赖性的循环置换方法。

Biology (Basel). 2024 Mar 8;13(3):175. doi: 10.3390/biology13030175.

Diabetes and bacterial co-infection are two independent risk factors for respiratory syncytial virus disease severity.糖尿病和细菌合并感染是呼吸道合胞病毒疾病严重程度的两个独立危险因素。

Front Med (Lausanne). 2023 Nov 1;10:1231641. doi: 10.3389/fmed.2023.1231641. eCollection 2023.

Blood-Type-A is a COVID-19 infection and hospitalization risk in a Turkish cohort.A型血是土耳其队列中 COVID-19 感染和住院的风险因素。

Transfus Clin Biol. 2023 Feb;30(1):116-122. doi: 10.1016/j.tracli.2022.10.003. Epub 2022 Oct 13.

A composite ranking of risk factors for COVID-19 time-to-event data from a Turkish cohort.来自土耳其队列的 COVID-19 时间事件数据的风险因素综合排名。

Comput Biol Chem. 2022 Jun;98:107681. doi: 10.1016/j.compbiolchem.2022.107681. Epub 2022 Apr 9.

本文引用的文献

The Impact of Pathway Database Choice on Statistical Enrichment Analysis and Predictive Modeling.通路数据库选择对统计富集分析和预测建模的影响。

Front Genet. 2019 Nov 22;10:1203. doi: 10.3389/fgene.2019.01203. eCollection 2019.

Evidence from marginally significant statistics.来自边缘显著统计数据的证据。

Am Stat. 2019;73(Suppl 1):129-134. doi: 10.1080/00031305.2018.1518788. Epub 2019 Mar 20.

Genetic meta-analysis of diagnosed Alzheimer's disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing.基于诊断的阿尔茨海默病的全基因组关联荟萃分析鉴定出新的风险位点，并提示 Aβ、tau、免疫和脂类代谢过程的作用。

Nat Genet. 2019 Mar;51(3):414-430. doi: 10.1038/s41588-019-0358-2. Epub 2019 Feb 28.

Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap.使用 g:Profiler、GSEA、Cytoscape 和 EnrichmentMap 进行组学数据的通路富集分析和可视化。

Nat Protoc. 2019 Feb;14(2):482-517. doi: 10.1038/s41596-018-0103-9.

Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk.全基因组荟萃分析确定了新的位点和功能途径，影响阿尔茨海默病的风险。

Nat Genet. 2019 Mar;51(3):404-413. doi: 10.1038/s41588-018-0311-9. Epub 2019 Jan 7.

Using set theory to reduce redundancy in pathway sets.运用集合论减少通路集的冗余。

BMC Bioinformatics. 2018 Oct 19;19(1):386. doi: 10.1186/s12859-018-2355-3.

The Role of Genetics in Advancing Precision Medicine for Alzheimer's Disease-A Narrative Review.遗传学在推动阿尔茨海默病精准医学发展中的作用——一篇叙述性综述

Front Med (Lausanne). 2018 Apr 24;5:108. doi: 10.3389/fmed.2018.00108. eCollection 2018.

The reproducibility of research and the misinterpretation of -values.研究的可重复性与P值的错误解读

R Soc Open Sci. 2017 Dec 6;4(12):171085. doi: 10.1098/rsos.171085. eCollection 2017 Dec.

Myriads: P-value-based multiple testing correction.Myriads：基于 P 值的多重检验校正。

Bioinformatics. 2018 Mar 15;34(6):1043-1045. doi: 10.1093/bioinformatics/btx746.

A comparison of multiple testing adjustment methods with block-correlation positively-dependent tests.具有块相关正相依检验的多重检验调整方法比较。

PLoS One. 2017 Apr 28;12(4):e0176124. doi: 10.1371/journal.pone.0176124. eCollection 2017.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验