Storey John D, Tibshirani Robert
Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
Proc Natl Acad Sci U S A. 2003 Aug 5;100(16):9440-5. doi: 10.1073/pnas.1530509100. Epub 2003 Jul 25.
With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.
随着全基因组实验的增加以及多个基因组的测序,对大数据集的分析在生物学中已变得很常见。通常情况下,全基因组数据集中的数千个特征会针对某个零假设进行检验,预计其中一些特征会具有显著性。在此,我们基于错误发现率的概念提出一种在这些全基因组研究中衡量统计显著性的方法。这种方法在真阳性和假阳性数量之间提供了一种合理的平衡,它会自动校准且易于解释。这样一来,一种名为q值的统计显著性度量就与每个被检验的特征相关联。q值类似于广为人知的p值,不同之处在于它是根据错误发现率而非假阳性率来衡量显著性的。我们的方法避免了大量假阳性结果的出现,同时提供了一种比在连锁基因组扫描中所使用的更为宽松的标准。