通过权重和的分布函数进行稳健且准确的数据增强统计。

Robust and accurate data enrichment statistics via distribution function of sum of weights.

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

出版信息

Bioinformatics. 2010 Nov 1;26(21):2752-9. doi: 10.1093/bioinformatics/btq511. Epub 2010 Sep 8.

DOI:10.1093/bioinformatics/btq511

PMID:20826881

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2958744/

Abstract

MOTIVATION

Term-enrichment analysis facilitates biological interpretation by assigning to experimentally/computationally obtained data annotation associated with terms from controlled vocabularies. This process usually involves obtaining statistical significance for each vocabulary term and using the most significant terms to describe a given set of biological entities, often associated with weights. Many existing enrichment methods require selections of (arbitrary number of) the most significant entities and/or do not account for weights of entities. Others either mandate extensive simulations to obtain statistics or assume normal weight distribution. In addition, most methods have difficulty assigning correct statistical significance to terms with few entities.

RESULTS

Implementing the well-known Lugananni-Rice formula, we have developed a novel approach, called SaddleSum, that is free from all the aforementioned constraints and evaluated it against several existing methods. With entity weights properly taken into account, SaddleSum is internally consistent and stable with respect to the choice of number of most significant entities selected. Making few assumptions on the input data, the proposed method is universal and can thus be applied to areas beyond analysis of microarrays. Employing asymptotic approximation, SaddleSum provides a term-size-dependent score distribution function that gives rise to accurate statistical significance even for terms with few entities. As a consequence, SaddleSum enables researchers to place confidence in its significance assignments to small terms that are often biologically most specific.

AVAILABILITY

Our implementation, which uses Bonferroni correction to account for multiple hypotheses testing, is available at http://www.ncbi.nlm.nih.gov/CBBresearch/qmbp/mn/enrich/. Source code for the standalone version can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/SaddleSum/.

摘要

动机

通过将与受控词汇表中的术语相关联的实验/计算获得的数据注释分配给术语丰富分析，促进了生物学解释。此过程通常涉及为每个词汇术语获得统计显着性，并使用最显着的术语来描述给定的一组生物实体，通常与权重相关联。许多现有的富集方法需要选择（任意数量的）最重要的实体，或者不考虑实体的权重。其他方法要么需要进行大量模拟才能获得统计数据，要么假设权重分布正常。此外，大多数方法难以为具有少量实体的术语分配正确的统计显着性。

结果

我们实现了著名的 Lugananni-Rice 公式，开发了一种称为 SaddleSum 的新方法，该方法不受上述所有限制，并针对几种现有方法进行了评估。通过适当考虑实体权重，SaddleSum 是内部一致的，并且与选择选择的最重要实体的数量是稳定的。对输入数据的假设很少，因此该方法是通用的，可以应用于微阵列分析以外的领域。采用渐近近似，SaddleSum 提供了与术语大小相关的得分分布函数，即使对于具有少量实体的术语，也可以提供准确的统计显着性。因此，SaddleSum 使研究人员能够对通常生物学上最具体的小术语的显着性分配产生信心。

可用性

我们的实现使用 Bonferroni 校正来考虑多重假设检验，可在 http://www.ncbi.nlm.nih.gov/CBBresearch/qmbp/mn/enrich/ 获得。独立版本的源代码可从 ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/SaddleSum/ 下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9cb/2958744/9d50c38b42e3/btq511f1.jpg

相似文献

Robust and accurate data enrichment statistics via distribution function of sum of weights.

Bioinformatics. 2010 Nov 1;26(21):2752-9. doi: 10.1093/bioinformatics/btq511. Epub 2010 Sep 8.

GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists.

BMC Bioinformatics. 2009 Feb 3;10:48. doi: 10.1186/1471-2105-10-48.

Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution.

Bioinformatics. 2016 Sep 1;32(17):2642-9. doi: 10.1093/bioinformatics/btw225. Epub 2016 Apr 29.

Mass spectrometry-based protein identification with accurate statistical significance assignment.

Bioinformatics. 2015 Mar 1;31(5):699-706. doi: 10.1093/bioinformatics/btu717. Epub 2014 Oct 31.

ITM Probe: analyzing information flow in protein networks.

Bioinformatics. 2009 Sep 15;25(18):2447-9. doi: 10.1093/bioinformatics/btp398. Epub 2009 Jun 27.

Assessment of disease named entity recognition on a corpus of annotated sentences.

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics.

PLoS One. 2010 Nov 16;5(11):e15438. doi: 10.1371/journal.pone.0015438.

Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae.

BMC Microbiol. 2009 Feb 19;9 Suppl 1(Suppl 1):S8. doi: 10.1186/1471-2180-9-S1-S8.

Evaluating the coverage of controlled health data terminologies: report on the results of the NLM/AHCPR large scale vocabulary test.

J Am Med Inform Assoc. 1997 Nov-Dec;4(6):484-500. doi: 10.1136/jamia.1997.0040484.

Combining independent, weighted P-values: achieving computational stability by a systematic expansion with controllable accuracy.

PLoS One. 2011;6(8):e22647. doi: 10.1371/journal.pone.0022647. Epub 2011 Aug 31.

引用本文的文献

Gene coexpression networks reveal a broad role for lncRNAs in inflammatory bowel disease.

JCI Insight. 2024 Feb 8;9(3):e168988. doi: 10.1172/jci.insight.168988.

Guselkumab Modulates Differentially Expressed Genes in Blood of Patients With Psoriatic Arthritis: Results from Two Phase 3, Randomized, Placebo-Controlled Trials.

ACR Open Rheumatol. 2023 Sep;5(9):490-498. doi: 10.1002/acr2.11589. Epub 2023 Aug 8.

PAI-1 augments mucosal damage in colitis.

Sci Transl Med. 2019 Mar 6;11(482). doi: 10.1126/scitranslmed.aat0852.

Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry.

J Am Soc Mass Spectrom. 2018 Aug;29(8):1721-1737. doi: 10.1007/s13361-018-1986-y. Epub 2018 Jun 5.

Significance evaluation in factor graphs.

BMC Bioinformatics. 2017 Mar 31;18(1):199. doi: 10.1186/s12859-017-1614-z.

DeCoaD: determining correlations among diseases using protein interaction networks.

BMC Res Notes. 2015 Jun 6;8:226. doi: 10.1186/s13104-015-1211-z.

BiNChE: a web tool and library for chemical enrichment analysis based on the ChEBI ontology.

BMC Bioinformatics. 2015 Feb 21;16(1):56. doi: 10.1186/s12859-015-0486-3.

Relating diseases by integrating gene associations and information flow through protein interaction network.

PLoS One. 2014 Oct 31;9(10):e110936. doi: 10.1371/journal.pone.0110936. eCollection 2014.

Building a hierarchical organization of protein complexes out of protein association data.

PLoS One. 2014 Jun 30;9(6):e100098. doi: 10.1371/journal.pone.0100098. eCollection 2014.

Using context-specific effect of miRNAs to identify functional associations between miRNAs and gene signatures.

BMC Bioinformatics. 2013;14 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-14-S12-S1. Epub 2013 Sep 24.

本文引用的文献

Proteomics strategy for quantitative protein interaction profiling in cell extracts.

Nat Methods. 2009 Oct;6(10):741-4. doi: 10.1038/nmeth.1373. Epub 2009 Sep 13.

ITM Probe: analyzing information flow in protein networks.

Bioinformatics. 2009 Sep 15;25(18):2447-9. doi: 10.1093/bioinformatics/btp398. Epub 2009 Jun 27.

GAGE: generally applicable gene set enrichment for pathway analysis.

BMC Bioinformatics. 2009 May 27;10:161. doi: 10.1186/1471-2105-10-161.

GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists.

BMC Bioinformatics. 2009 Feb 3;10:48. doi: 10.1186/1471-2105-10-48.

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists.

Nucleic Acids Res. 2009 Jan;37(1):1-13. doi: 10.1093/nar/gkn923. Epub 2008 Nov 25.

NCBI GEO: archive for high-throughput functional genomic data.

Nucleic Acids Res. 2009 Jan;37(Database issue):D885-90. doi: 10.1093/nar/gkn764. Epub 2008 Oct 21.

A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome.

Science. 2008 Aug 15;321(5891):956-60. doi: 10.1126/science.1160342. Epub 2008 Jul 3.

The BioGRID Interaction Database: 2008 update.

Nucleic Acids Res. 2008 Jan;36(Database issue):D637-40. doi: 10.1093/nar/gkm1001. Epub 2007 Nov 13.

Information flow in interaction networks.

J Comput Biol. 2007 Oct;14(8):1115-43. doi: 10.1089/cmb.2007.0069.

GeneTrail--advanced gene set enrichment analysis.

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W186-92. doi: 10.1093/nar/gkm323. Epub 2007 May 25.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过权重和的分布函数进行稳健且准确的数据增强统计。

Robust and accurate data enrichment statistics via distribution function of sum of weights.

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.