Frost H Robert, Li Zhigang, Asselbergs Folkert W, Moore Jason H
IEEE/ACM Trans Comput Biol Bioinform. 2015 Sep-Oct;12(5):1076-86. doi: 10.1109/TCBB.2015.2415815.
Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.
基因集测试已成为分析高维基因组数据不可或缺的工具。测试基因集而非单个基因组变量的一个重要动机是通过减少测试假设的数量来提高统计功效。然而,鉴于常见基因集集合的急剧增长,测试时使用的基因集数量往往与潜在的基因组变量数量几乎一样多。为应对大型基因集集合对统计功效构成的挑战,我们开发了谱基因集过滤(SGSF)方法,这是一种在基因集测试之前对基因集集合进行独立过滤的新技术。SGSF方法使用衡量每个基因集与样本主成分(PC)之间关联统计显著性的p值作为过滤统计量,并考虑相关特征值的显著性。由于在原假设下此过滤统计量与标准基因集测试统计量无关,但在备择假设下相关,因此在不影响I型错误率的情况下,富集基因集的比例会增加。如使用模拟和真实基因表达数据所示,SGSF算法能准确过滤与实验结果无关的基因集,从而显著提高基因集测试功效。