Troyanskaya Olga G, Garber Mitchell E, Brown Patrick O, Botstein David, Altman Russ B
Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.
Bioinformatics. 2002 Nov;18(11):1454-61. doi: 10.1093/bioinformatics/18.11.1454.
Gene expression experiments provide a fast and systematic way to identify disease markers relevant to clinical care. In this study, we address the problem of robust identification of differentially expressed genes from microarray data. Differentially expressed genes, or discriminator genes, are genes with significantly different expression in two user-defined groups of microarray experiments. We compare three model-free approaches: (1). nonparametric t-test, (2). Wilcoxon (or Mann-Whitney) rank sum test, and (3). a heuristic method based on high Pearson correlation to a perfectly differentiating gene ('ideal discriminator method'). We systematically assess the performance of each method based on simulated and biological data under varying noise levels and p-value cutoffs.
All methods exhibit very low false positive rates and identify a large fraction of the differentially expressed genes in simulated data sets with noise level similar to that of actual data. Overall, the rank sum test appears most conservative, which may be advantageous when the computationally identified genes need to be tested biologically. However, if a more inclusive list of markers is desired, a higher p-value cutoff or the nonparametric t-test may be appropriate. When applied to data from lung tumor and lymphoma data sets, the methods identify biologically relevant differentially expressed genes that allow clear separation of groups in question. Thus the methods described and evaluated here provide a convenient and robust way to identify differentially expressed genes for further biological and clinical analysis.
基因表达实验提供了一种快速且系统的方法来识别与临床护理相关的疾病标志物。在本研究中,我们解决了从微阵列数据中稳健识别差异表达基因的问题。差异表达基因,即鉴别基因,是在两组用户定义的微阵列实验中表达有显著差异的基因。我们比较了三种无模型方法:(1)非参数t检验,(2)威尔科克森(或曼 - 惠特尼)秩和检验,以及(3)一种基于与完美区分基因的高皮尔逊相关性的启发式方法(“理想鉴别方法”)。我们基于模拟数据和生物数据,在不同噪声水平和p值截止值下系统地评估了每种方法的性能。
在噪声水平与实际数据相似的模拟数据集中,所有方法均表现出非常低的假阳性率,并识别出了大部分差异表达基因。总体而言,秩和检验似乎最为保守,当需要对通过计算识别出的基因进行生物学测试时,这可能具有优势。然而,如果需要更具包容性的标志物列表,则较高的p值截止值或非参数t检验可能更为合适。当应用于肺癌和淋巴瘤数据集的数据时,这些方法识别出了具有生物学相关性的差异表达基因,这些基因能够清晰地区分相关组。因此,本文所述及评估的方法为识别差异表达基因以进行进一步的生物学和临床分析提供了一种便捷且稳健的方式。