National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Bioinformatics. 2010 Nov 1;26(21):2752-9. doi: 10.1093/bioinformatics/btq511. Epub 2010 Sep 8.
Term-enrichment analysis facilitates biological interpretation by assigning to experimentally/computationally obtained data annotation associated with terms from controlled vocabularies. This process usually involves obtaining statistical significance for each vocabulary term and using the most significant terms to describe a given set of biological entities, often associated with weights. Many existing enrichment methods require selections of (arbitrary number of) the most significant entities and/or do not account for weights of entities. Others either mandate extensive simulations to obtain statistics or assume normal weight distribution. In addition, most methods have difficulty assigning correct statistical significance to terms with few entities.
Implementing the well-known Lugananni-Rice formula, we have developed a novel approach, called SaddleSum, that is free from all the aforementioned constraints and evaluated it against several existing methods. With entity weights properly taken into account, SaddleSum is internally consistent and stable with respect to the choice of number of most significant entities selected. Making few assumptions on the input data, the proposed method is universal and can thus be applied to areas beyond analysis of microarrays. Employing asymptotic approximation, SaddleSum provides a term-size-dependent score distribution function that gives rise to accurate statistical significance even for terms with few entities. As a consequence, SaddleSum enables researchers to place confidence in its significance assignments to small terms that are often biologically most specific.
Our implementation, which uses Bonferroni correction to account for multiple hypotheses testing, is available at http://www.ncbi.nlm.nih.gov/CBBresearch/qmbp/mn/enrich/. Source code for the standalone version can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/SaddleSum/.
通过将与受控词汇表中的术语相关联的实验/计算获得的数据注释分配给术语丰富分析,促进了生物学解释。此过程通常涉及为每个词汇术语获得统计显着性,并使用最显着的术语来描述给定的一组生物实体,通常与权重相关联。许多现有的富集方法需要选择(任意数量的)最重要的实体,或者不考虑实体的权重。其他方法要么需要进行大量模拟才能获得统计数据,要么假设权重分布正常。此外,大多数方法难以为具有少量实体的术语分配正确的统计显着性。
我们实现了著名的 Lugananni-Rice 公式,开发了一种称为 SaddleSum 的新方法,该方法不受上述所有限制,并针对几种现有方法进行了评估。通过适当考虑实体权重,SaddleSum 是内部一致的,并且与选择选择的最重要实体的数量是稳定的。对输入数据的假设很少,因此该方法是通用的,可以应用于微阵列分析以外的领域。采用渐近近似,SaddleSum 提供了与术语大小相关的得分分布函数,即使对于具有少量实体的术语,也可以提供准确的统计显着性。因此,SaddleSum 使研究人员能够对通常生物学上最具体的小术语的显着性分配产生信心。
我们的实现使用 Bonferroni 校正来考虑多重假设检验,可在 http://www.ncbi.nlm.nih.gov/CBBresearch/qmbp/mn/enrich/ 获得。独立版本的源代码可从 ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/SaddleSum/ 下载。