Jensen Shane T, Soi Sameer, Wang Li-San
Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA.
BMC Bioinformatics. 2009 Jun 28;10:198. doi: 10.1186/1471-2105-10-198.
Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.
We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation.
Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at http://people.pcbi.upenn.edu/~lswang/pub/bmc2009/.
由于高通量检测技术的进步以及大型生物数据库的整合,大规模统计分析已成为后基因组时代生物学研究的标志。随之而来的一个问题是对大量假设检验的p值进行同时估计。在许多应用中,原假设分布中的参数假设(如正态性)可能不合理,基于重抽样的p值是确定统计显著性的首选方法。在多重检验中使用基于重抽样的方法计算量很大,通常需要大量的重抽样。
我们提出了一种新方法,可在非参数多重检验框架内更有效地分配重抽样(如自助抽样或置换)。我们针对此问题制定了一种受贝叶斯启发的方法,并设计了一种算法,该算法以可忽略不计的空间和运行时间开销迭代地调整重抽样的分配。在两项实验研究中,一个乳腺癌微阵列数据集和一个帕金森病全基因组关联研究数据集,我们证明了与传统的均匀重抽样分配相比,我们的差异分配程序要准确得多。
我们的实验表明,使用更复杂的分配策略可以在不显著增加随机数据计算量的情况下改善我们对假设检验的推断。此外,当检验数量很大时,我们在效率上获得了更多提升。我们算法和快捷方法的R代码可在http://people.pcbi.upenn.edu/~lswang/pub/bmc2009/获取。