The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
Litwin-Zucker Center for the study of Alzheimer's Disease, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA; Division of Geriatric Psychiatry, Zucker Hillside Hospital, Northwell Health, Glen Oaks, NY, USA.
Comput Biol Chem. 2021 Jun;92:107455. doi: 10.1016/j.compbiolchem.2021.107455. Epub 2021 Feb 12.
A standard pathway/gene-set enrichment analysis, the over-representation analysis, is based on four values: the size of two gene-sets, size of their overlap, and size of the gene universe from which the gene-sets are chosen. The standard result of such an analysis is based on the p-value of a statistical test. We supplement this standard pipeline by six cautions: (1) any p-value threshold to distinguish enriched gene-sets from not-enriched ones is to certain degree arbitrary; (2) genes in a gene-set may be correlated, which potentially overcount the gene-set size; (3) any attempt to impose multiple testing correction will increase the false negative rate; (4) gene-sets in a gene-set database may be correlated, potentially overcount the factor for multiple testing correction; (5) the discrete nature of the data make it possible that a minimum change in counts may lead to a quantum change in the p-value threshold-based conclusion; (6) the two gene-sets may not be chosen from the universe of all human genes, but in fact from a subset of that universe, or even two different subsets of all genes. Careful reconsideration of these issues can have an impact on an enrichment analysis conclusion. Part of our cautions mirror the call from statistician that reaching conclusion from data is not a simple matter of p-value smaller than 0.05, but a thoughtful process with due diligences.
标准的通路/基因集富集分析(over-representation analysis)基于四个数值:两个基因集的大小、它们的重叠大小,以及从中选择基因集的基因宇宙的大小。这种分析的标准结果基于统计检验的 p 值。我们通过六个注意事项来补充这个标准流程:(1)任何用于区分富集基因集和非富集基因集的 p 值阈值在某种程度上都是任意的;(2)基因集中的基因可能相关,这可能会过度计算基因集的大小;(3)任何尝试施加多重检验校正的尝试都会增加假阴性率;(4)基因集数据库中的基因集可能相关,可能会过度计算多重检验校正的因素;(5)数据的离散性质使得计数的微小变化可能导致基于 p 值阈值的结论发生量子变化;(6)这两个基因集可能不是从所有人类基因的宇宙中选择的,而是实际上是从该宇宙的一个子集,甚至是所有基因的两个不同子集选择的。仔细考虑这些问题可能会对富集分析的结论产生影响。我们的部分注意事项反映了统计学家的呼吁,即从数据中得出结论不仅仅是 p 值小于 0.05 的简单问题,而是一个需要深思熟虑和勤勉的过程。