Pepe Margaret Sullivan, Longton Gary, Anderson Garnet L, Schummer Michel
Department of Biostatistics, University of Washington, Seattle, Washington 98195-7232, USA.
Biometrics. 2003 Mar;59(1):133-42. doi: 10.1111/1541-0420.00016.
High throughput technologies, such as gene expression arrays and protein mass spectrometry, allow one to simultaneously evaluate thousands of potential biomarkers that could distinguish different tissue types. Of particular interest here is distinguishing between cancerous and normal organ tissues. We consider statistical methods to rank genes (or proteins) in regards to differential expression between tissues. Various statistical measures are considered, and we argue that two measures related to the Receiver Operating Characteristic Curve are particularly suitable for this purpose. We also propose that sampling variability in the gene rankings be quantified, and suggest using the "selection probability function," the probability distribution of rankings for each gene. This is estimated via the bootstrap. A real dataset, derived from gene expression arrays of 23 normal and 30 ovarian cancer tissues, is analyzed. Simulation studies are also used to assess the relative performance of different statistical gene ranking measures and our quantification of sampling variability. Our approach leads naturally to a procedure for sample-size calculations, appropriate for exploratory studies that seek to identify differentially expressed genes.
高通量技术,如基因表达阵列和蛋白质质谱分析,使人们能够同时评估数千种可能区分不同组织类型的潜在生物标志物。这里特别感兴趣的是区分癌组织和正常器官组织。我们考虑用统计方法对基因(或蛋白质)在不同组织间的差异表达进行排序。我们考虑了各种统计量度,并认为与受试者工作特征曲线相关的两种量度特别适合于此目的。我们还提议对基因排名中的抽样变异性进行量化,并建议使用“选择概率函数”,即每个基因排名的概率分布。这通过自助法进行估计。我们分析了一个真实数据集,该数据集来自23个正常卵巢组织和30个卵巢癌组织的基因表达阵列。模拟研究也用于评估不同统计基因排名量度的相对性能以及我们对抽样变异性的量化。我们的方法自然地引出了一个样本量计算程序,适用于旨在识别差异表达基因的探索性研究。