Bioinformatics Research Group, SRI International, 333 Ravenswood Drive, Menlo Park, 94025, CA, USA.
Department of Biochemistry, Molecular Biology, and Biophysics, University of Minnesota, 140 Gortner Lab, 1479 Gortner Ave, Saint Paul, 55198, MN, USA.
BMC Genomics. 2021 Mar 16;22(1):191. doi: 10.1186/s12864-021-07502-8.
Enrichment or over-representation analysis is a common method used in bioinformatics studies of transcriptomics, metabolomics, and microbiome datasets. The key idea behind enrichment analysis is: given a set of significantly expressed genes (or metabolites), use that set to infer a smaller set of perturbed biological pathways or processes, in which those genes (or metabolites) play a role. Enrichment computations rely on collections of defined biological pathways and/or processes, which are usually drawn from pathway databases. Although practitioners of enrichment analysis take great care to employ statistical corrections (e.g., for multiple testing), they appear unaware that enrichment results are quite sensitive to the pathway definitions that the calculation uses.
We show that alternative pathway definitions can alter enrichment p-values by up to nine orders of magnitude, whereas statistical corrections typically alter enrichment p-values by only two orders of magnitude. We present multiple examples where the smaller pathway definitions used in the EcoCyc database produces stronger enrichment p-values than the much larger pathway definitions used in the KEGG database; we demonstrate that to attain a given enrichment p-value, KEGG-based enrichment analyses require 1.3-2.0 times as many significantly expressed genes as does EcoCyc-based enrichment analyses. The large pathways in KEGG are problematic for another reason: they blur together multiple (as many as 21) biological processes. When such a KEGG pathway receives a high enrichment p-value, which of its component processes is perturbed is unclear, and thus the biological conclusions drawn from enrichment of large pathways are also in question.
The choice of pathway database used in enrichment analyses can have a much stronger effect on the enrichment results than the statistical corrections used in these analyses.
富集或过表达分析是转录组学、代谢组学和微生物组学数据的生物信息学研究中常用的方法。富集分析的关键思想是:给定一组显著表达的基因(或代谢物),利用该组推断出较小的一组受干扰的生物途径或过程,其中这些基因(或代谢物)发挥作用。富集计算依赖于定义明确的生物途径和/或过程的集合,这些集合通常来自途径数据库。尽管富集分析的实践者非常注意采用统计校正(例如,用于多重检验),但他们似乎没有意识到富集结果对计算中使用的途径定义非常敏感。
我们表明,替代途径定义可以将富集 p 值改变多达九个数量级,而统计校正通常仅将富集 p 值改变两个数量级。我们提出了多个示例,其中 EcoCyc 数据库中使用的较小途径定义产生的富集 p 值比 KEGG 数据库中使用的大得多的途径定义要强;我们证明,为了达到给定的富集 p 值,基于 KEGG 的富集分析需要比基于 EcoCyc 的富集分析多 1.3-2.0 倍的显著表达基因。KEGG 中的大途径还有另一个问题:它们将多个(多达 21 个)生物过程混合在一起。当这样的 KEGG 途径获得高富集 p 值时,不清楚其组成过程中的哪一个受到干扰,因此,从大途径的富集中得出的生物学结论也存在疑问。
在富集分析中使用的途径数据库的选择对富集结果的影响可能比这些分析中使用的统计校正大得多。