Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205-2179, USA.
BMC Bioinformatics. 2012 Jun 27;13:150. doi: 10.1186/1471-2105-13-150.
Genomic technologies are, by their very nature, designed for hypothesis generation. In some cases, the hypotheses that are generated require that genome scientists confirm findings about specific genes or proteins. But one major advantage of high-throughput technology is that global genetic, genomic, transcriptomic, and proteomic behaviors can be observed. Manual confirmation of every statistically significant genomic result is prohibitively expensive. This has led researchers in genomics to adopt the strategy of confirming only a handful of the most statistically significant results, a small subset chosen for biological interest, or a small random subset. But there is no standard approach for selecting and quantitatively evaluating validation targets.
Here we present a new statistical method and approach for statistically validating lists of significant results based on confirming only a small random sample. We apply our statistical method to show that the usual practice of confirming only the most statistically significant results does not statistically validate result lists. We analyze an extensively validated RNA-sequencing experiment to show that confirming a random subset can statistically validate entire lists of significant results. Finally, we analyze multiple publicly available microarray experiments to show that statistically validating random samples can both (i) provide evidence to confirm long gene lists and (ii) save thousands of dollars and hundreds of hours of labor over manual validation of each significant result.
For high-throughput -omics studies, statistical validation is a cost-effective and statistically valid approach to confirming lists of significant results.
基因组学技术本质上是为了生成假设而设计的。在某些情况下,生成的假设需要基因组科学家确认特定基因或蛋白质的发现。但是,高通量技术的一个主要优势是可以观察到全局遗传、基因组、转录组和蛋白质组行为。手动确认每一个具有统计学意义的基因组结果都是非常昂贵的。这导致基因组学研究人员采用了只确认少数具有统计学意义的结果的策略,选择一小部分具有生物学意义的结果,或者选择一小部分随机结果。但是,没有标准的方法来选择和定量评估验证目标。
在这里,我们提出了一种新的统计方法和方法,用于仅通过确认小的随机样本来验证具有统计学意义的结果列表。我们应用我们的统计方法来表明,只确认最具统计学意义的结果的通常做法并不能对结果列表进行统计学验证。我们分析了一个经过广泛验证的 RNA-seq 实验,以表明确认随机子集可以对整个具有统计学意义的结果列表进行统计学验证。最后,我们分析了多个公开的微阵列实验,以表明对随机样本进行统计学验证既可以提供证据来确认长基因列表,又可以节省数千美元和数百小时的人工验证每个显著结果的劳动。
对于高通量的组学研究,统计验证是一种经济有效的方法,可以确认具有统计学意义的结果列表。