Maglietta R, Piepoli A, Catalano D, Licciulli F, Carella M, Liuni S, Pesole G, Perri F, Ancona N
Istituto di Studi sui Sistemi Intelligenti per l'Automazione, CNR, Via Amendola 122/D-I, 70126 Bari, Italy.
Bioinformatics. 2007 Aug 15;23(16):2063-72. doi: 10.1093/bioinformatics/btm289. Epub 2007 May 31.
A major challenge in current biomedical research is the identification of cellular processes deregulated in a given pathology through the analysis of gene expression profiles. To this end, predefined lists of genes, coding specific functions, are compared with a list of genes ordered according to their values of differential expression measured by suitable univariate statistics.
We propose a statistically well-founded method for measuring the relevance of predefined lists of genes and for assessing their statistical significance starting from their raw expression levels as recorded on the microarray. We use prediction accuracy as a measure of relevance of the list. The rationale is that a functional category, coded through a list of genes, is perturbed in a given pathology if it is possible to correctly predict the occurrence of the disease in new subjects on the basis of the expression levels of the genes belonging to the list only. The accuracy is estimated with multiple random validation strategy and its statistical significance is assessed against a couple of null hypothesis, by using two independent permutation tests. The utility of the proposed methodology is illustrated by analyzing the relevance of Gene Ontology terms belonging to biological process category in colon and prostate cancer, by using three different microarray data sets and by comparing it with current approaches.
Source code for the algorithms is available from author upon request.
Colon cancer data set and a complete description of experimental results are available at: ftp://bioftp:76bioftpxxx@marx.ba.issia.cnr.it/supp-info.htm.
当前生物医学研究中的一个主要挑战是通过基因表达谱分析来识别在特定病理状态下失调的细胞过程。为此,将编码特定功能的预定义基因列表与根据通过合适的单变量统计测量的差异表达值排序的基因列表进行比较。
我们提出了一种基于统计学的方法,用于从微阵列记录的原始表达水平开始测量预定义基因列表的相关性并评估其统计显著性。我们使用预测准确性作为列表相关性的度量。基本原理是,如果仅基于属于该列表的基因的表达水平就能够正确预测新受试者中疾病的发生,那么通过基因列表编码的功能类别在给定病理状态下就会受到干扰。通过多重随机验证策略估计准确性,并通过使用两个独立的置换检验针对几个零假设评估其统计显著性。通过使用三个不同的微阵列数据集并将其与当前方法进行比较,分析属于生物学过程类别的基因本体术语在结肠癌和前列腺癌中的相关性,说明了所提出方法的实用性。
可根据作者要求提供算法的源代码。
结肠癌数据集和实验结果的完整描述可在以下网址获取:ftp://bioftp:76bioftpxxx@marx.ba.issia.cnr.it/supp-info.htm。