Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA.
Department of Psychiatry and Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
Nucleic Acids Res. 2017 Feb 28;45(4):e20. doi: 10.1093/nar/gkw957.
Gene set analysis, which translates gene lists into enriched functions, is among the most common bioinformatic methods. Yet few would advocate taking the results at face value. Not only is there no agreement on the algorithms themselves, there is no agreement on how to benchmark them. In this paper, we evaluate the robustness and uniqueness of enrichment results as a means of assessing methods even where correctness is unknown. We show that heavily annotated (‘multifunctional’) genes are likely to appear in genomics study results and drive the generation of biologically non-specific enrichment results as well as highly fragile significances. By providing a means of determining where enrichment analyses report non-specific and non-robust findings, we are able to assess where we can be confident in their use. We find significant progress in recent bias correction methods for enrichment and provide our own software implementation. Our approach can be readily adapted to any pre-existing package.
基因集分析(Gene set analysis)将基因列表转化为富集功能,是最常见的生物信息学方法之一。然而,很少有人会主张盲目接受结果。不仅算法本身没有达成共识,而且在如何对其进行基准测试方面也没有达成共识。在本文中,我们评估了富集结果的稳健性和独特性,即使在正确性未知的情况下,也可以作为评估方法的一种手段。我们表明,注释较多(“多功能”)的基因很可能出现在基因组学研究结果中,并导致产生生物学上非特异性的富集结果以及高度脆弱的显著性。通过提供一种确定富集分析报告非特异性和非稳健结果的方法,我们能够评估在何处可以自信地使用它们。我们发现,最近的富集偏差校正方法取得了显著进展,并提供了我们自己的软件实现。我们的方法可以很容易地适应任何现有的软件包。