Phipson Belinda, Lee Stanley, Majewski Ian J, Alexander Warren S, Smyth Gordon K
Murdoch Childrens Research Institute.
The Walter and Eliza Hall Institute of Medical Research; The University of Melbourne.
Ann Appl Stat. 2016 Jun;10(2):946-963. doi: 10.1214/16-AOAS920. Epub 2016 Jul 22.
One of the most common analysis tasks in genomic research is to identify genes that are differentially expressed (DE) between experimental conditions. Empirical Bayes (EB) statistical tests using moderated genewise variances have been very effective for this purpose, especially when the number of biological replicate samples is small. The EB procedures can however be heavily influenced by a small number of genes with very large or very small variances. This article improves the differential expression tests by robustifying the hyperparameter estimation procedure. The robust procedure has the effect of decreasing the informativeness of the prior distribution for outlier genes while increasing its informativeness for other genes. This effect has the double benefit of reducing the chance that hypervariable genes will be spuriously identified as DE while increasing statistical power for the main body of genes. The robust EB algorithm is fast and numerically stable. The procedure allows exact small-sample null distributions for the test statistics and reduces exactly to the original EB procedure when no outlier genes are present. Simulations show that the robustified tests have similar performance to the original tests in the absence of outlier genes but have greater power and robustness when outliers are present. The article includes case studies for which the robust method correctly identifies and downweights genes associated with hidden covariates and detects more genes likely to be scientifically relevant to the experimental conditions. The new procedure is implemented in the limma software package freely available from the Bioconductor repository.
基因组研究中最常见的分析任务之一是识别在不同实验条件下差异表达(DE)的基因。使用适度基因方差的经验贝叶斯(EB)统计检验在这方面非常有效,特别是当生物重复样本数量较少时。然而,EB程序可能会受到少数具有非常大或非常小方差的基因的严重影响。本文通过强化超参数估计程序改进了差异表达检验。稳健程序具有降低异常值基因先验分布的信息量,同时增加其他基因先验分布信息量的效果。这种效果具有双重好处,既减少了高变异性基因被错误识别为差异表达基因的可能性,又增加了主体基因的统计功效。稳健的EB算法快速且数值稳定。该程序允许测试统计量有精确的小样本零分布,并且在不存在异常值基因时精确地简化为原始的EB程序。模拟表明,在不存在异常值基因的情况下,稳健化检验与原始检验具有相似的性能,但在存在异常值时具有更大的功效和稳健性。本文包含案例研究,其中稳健方法正确识别并降低了与隐藏协变量相关的基因的权重,并检测到更多可能与实验条件在科学上相关的基因。新程序在可从Bioconductor存储库免费获得的limma软件包中实现。