Vêncio Ricardo Z N, Shmulevich Ilya
Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA.
BMC Bioinformatics. 2007 Oct 12;8:383. doi: 10.1186/1471-2105-8-383.
As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test.
We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/.
We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.
与许多其他科学领域一样,系统生物学在列联表中广泛使用统计关联和显著性估计,在该领域这种类型的分类数据分析被称为富集(也称为过度表达或增强)分析。尽管人们努力创建概率注释,特别是在基因本体论的背景下,或者处理基于高通量数据集的不确定性,但当前的富集方法很大程度上忽略了这种概率信息,因为它们主要基于Fisher精确检验的变体。
我们开发了一个基于R的开源软件ProbCD来处理概率分类数据分析,该软件不需要静态列联表。富集问题的列联表是根据给定分类概率的伯努利方案随机过程的期望构建的。创建了一个在线界面,供非程序员使用,可在以下网址获取:http://xerad.systemsbiology.net/ProbCD/。
我们提出了一个分析框架和软件工具来解决分类数据分析中的不确定性问题。特别是,关于富集分析,ProbCD可以适应:(i)高通量实验技术的随机性和(ii)概率基因注释。