1] Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA [2] Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA [3] Department of Physics, University of Maryland, College Park, MD, USA.
1] Department of Physics, University of Maryland, College Park, MD, USA [2] Institute for Physical Science and Technology, University of Maryland, College Park, MD, USA [3] Santa Fe Institute, Santa Fe, NM.
Sci Rep. 2014 Feb 26;4:4191. doi: 10.1038/srep04191.
Gene annotation databases (compendiums maintained by the scientific community that describe the biological functions performed by individual genes) are commonly used to evaluate the functional properties of experimentally derived gene sets. Overlap statistics, such as Fishers Exact test (FET), are often employed to assess these associations, but don't account for non-uniformity in the number of genes annotated to individual functions or the number of functions associated with individual genes. We find FET is strongly biased toward over-estimating overlap significance if a gene set has an unusually high number of annotations. To correct for these biases, we develop Annotation Enrichment Analysis (AEA), which properly accounts for the non-uniformity of annotations. We show that AEA is able to identify biologically meaningful functional enrichments that are obscured by numerous false-positive enrichment scores in FET, and we therefore suggest it be used to more accurately assess the biological properties of gene sets.
基因注释数据库(科学界维护的描述单个基因所执行的生物学功能的综合数据库)通常用于评估实验得出的基因集的功能特性。重叠统计,如 Fisher 精确检验(FET),常用于评估这些关联,但没有考虑到注释到单个功能的基因数量或与单个基因相关的功能数量的不均匀性。我们发现,如果基因集的注释数量异常多,FET 会强烈偏向于高估重叠的显著性。为了纠正这些偏差,我们开发了 Annotation Enrichment Analysis(AEA),它可以正确地考虑注释的不均匀性。我们表明,AEA 能够识别被 FET 中大量假阳性富集分数所掩盖的有生物学意义的功能富集,因此建议使用它来更准确地评估基因集的生物学特性。