Mukherjee Sach, Roberts Stephen J, van der Laan Mark J
Department of Engineering Science, University of Oxford, UK.
Bioinformatics. 2005 Sep 1;21 Suppl 2:ii108-14. doi: 10.1093/bioinformatics/bti1119.
An important task in microarray data analysis is the selection of genes that are differentially expressed between different tissue samples, such as healthy and diseased. However, microarray data contain an enormous number of dimensions (genes) and very few samples (arrays), a mismatch which poses fundamental statistical problems for the selection process that have defied easy resolution.
In this paper, we present a novel approach to the selection of differentially expressed genes in which test statistics are learned from data using a simple notion of reproducibility in selection results as the learning criterion. Reproducibility, as we define it, can be computed without any knowledge of the 'ground-truth', but takes advantage of certain properties of microarray data to provide an asymptotically valid guide to expected loss under the true data-generating distribution. We are therefore able to indirectly minimize expected loss, and obtain results substantially more robust than conventional methods. We apply our method to simulated and oligonucleotide array data.
By request to the corresponding author.
微阵列数据分析中的一项重要任务是选择在不同组织样本(如健康样本和患病样本)之间差异表达的基因。然而,微阵列数据包含大量维度(基因)和极少样本(阵列),这种不匹配给选择过程带来了基本的统计问题,难以轻易解决。
在本文中,我们提出了一种选择差异表达基因的新方法,其中使用选择结果的可重复性这一简单概念作为学习标准从数据中学习检验统计量。按照我们的定义,可重复性无需任何“真实情况”的知识即可计算,但利用微阵列数据的某些特性为真实数据生成分布下的预期损失提供渐近有效的指导。因此,我们能够间接最小化预期损失,并获得比传统方法更稳健得多的结果。我们将我们的方法应用于模拟数据和寡核苷酸阵列数据。
可向通讯作者索取。