Jung Sin-Ho, Jang Woncheol
Department of Biostatistics and Bioinformatics, Duke University, NC 27710, USA.
Bioinformatics. 2006 Jul 15;22(14):1730-6. doi: 10.1093/bioinformatics/btl161. Epub 2006 Apr 27.
We want to evaluate the performance of two FDR-based multiple testing procedures by Benjamini and Hochberg (1995, J. R. Stat. Soc. Ser. B, 57, 289-300) and Storey (2002, J. R. Stat. Soc. Ser. B, 64, 479-498) in analyzing real microarray data. These procedures commonly require independence or weak dependence of the test statistics. However, expression levels of different genes from each array are usually correlated due to coexpressing genes and various sources of errors from experiment-specific and subject-specific conditions that are not adjusted for in data analysis. Because of high dimensionality of microarray data, it is usually impossible to check whether the weak dependence condition is met for a given dataset or not. We propose to generate a large number of test statistics from a simulation model which has asymptotically (in terms of the number of arrays) the same correlation structure as the test statistics that will be calculated from the given data and to investigate how accurately the FDR-based testing procedures control the FDR on the simulated data. Our approach is to directly check the performance of these procedures for a given dataset, rather than to check the weak dependency requirement. We illustrate the proposed method with real microarray datasets, one where the clinical endpoint is disease group and another where it is survival.
我们希望评估由本雅明尼和霍赫贝格(1995年,《皇家统计学会会刊》B辑,第57卷,第289 - 300页)以及斯托里(2002年,《皇家统计学会会刊》B辑,第64卷,第479 - 498页)提出的两种基于错误发现率(FDR)的多重检验程序在分析实际微阵列数据时的性能。这些程序通常要求检验统计量具有独立性或弱相关性。然而,由于共表达基因以及在数据分析中未针对特定实验条件和特定受试者条件进行调整的各种误差来源,每个阵列中不同基因的表达水平通常是相关的。由于微阵列数据的高维度性,通常无法检查给定数据集是否满足弱相关性条件。我们建议从一个模拟模型生成大量检验统计量,该模拟模型在渐近意义上(就阵列数量而言)与将从给定数据计算出的检验统计量具有相同的相关结构,并研究基于FDR的检验程序在模拟数据上对FDR的控制精度。我们的方法是直接检查这些程序在给定数据集上的性能,而不是检查弱相关性要求。我们用实际微阵列数据集说明了所提出的方法,一个数据集的临床终点是疾病组,另一个数据集的临床终点是生存期。