Lu Mengyin, Stephens Matthew
Department of Statistics, University of Chicago, Chicago, 60637, USA.
Department of Human Genetics, University of Chicago, Chicago, 60637, USA.
Bioinformatics. 2016 Nov 15;32(22):3428-3434. doi: 10.1093/bioinformatics/btw483. Epub 2016 Jul 19.
Genomic studies often involve estimation of variances of thousands of genes (or other genomic units) from just a few measurements on each. For example, variance estimation is an important step in gene expression analyses aimed at identifying differentially expressed genes. A common approach to this problem is to use an Empirical Bayes (EB) method that assumes the variances among genes follow an inverse-gamma distribution. This distributional assumption is relatively inflexible; for example, it may not capture 'outlying' genes whose variances are considerably bigger than usual. Here we describe a more flexible EB method, capable of capturing a much wider range of distributions. Indeed, the main assumption is that the distribution of the variances is unimodal (or, as an alternative, that the distribution of the precisions is unimodal). We argue that the unimodal assumption provides an attractive compromise between flexibility, computational tractability and statistical efficiency.
We show that this more flexible approach provides competitive performance with existing methods when the variances truly come from an inverse-gamma distribution, and can outperform them when the distribution of the variances is more complex. In analyses of several human gene expression datasets from the Genotype Tissues Expression consortium, we find that our more flexible model often fits the data appreciably better than the single inverse gamma distribution. At the same time we find that in these data this improved model fit leads to only small improvements in variance estimates and detection of differentially expressed genes.
Our methods are implemented in an R package vashr available from http://github.com/mengyin/vashr CONTACT: mstephens@uchicago.eduSupplementary information: Supplementary data are available at Bioinformatics online.
基因组研究通常涉及从对每个基因(或其他基因组单元)仅进行几次测量来估计数千个基因的方差。例如,方差估计是旨在识别差异表达基因的基因表达分析中的一个重要步骤。解决这个问题的一种常用方法是使用经验贝叶斯(EB)方法,该方法假设基因间的方差服从逆伽马分布。这种分布假设相对缺乏灵活性;例如,它可能无法捕捉方差比通常大得多的“异常”基因。在这里,我们描述一种更灵活的EB方法,它能够捕捉更广泛的分布范围。实际上,主要假设是方差的分布是单峰的(或者,作为一种替代,精度的分布是单峰的)。我们认为单峰假设在灵活性、计算易处理性和统计效率之间提供了一个有吸引力的折衷方案。
我们表明,当方差真正来自逆伽马分布时,这种更灵活的方法与现有方法具有竞争力的性能,并且当方差分布更复杂时,它可以优于现有方法。在对来自基因型组织表达联盟的几个人类基因表达数据集的分析中,我们发现我们更灵活的模型通常比单一逆伽马分布能更好地拟合数据。同时我们发现,在这些数据中,这种改进的模型拟合仅导致方差估计和差异表达基因检测方面的小幅改进。
我们的方法在一个R包vashr中实现,可从http://github.com/mengyin/vashr获取 联系方式:mstephens@uchicago.edu 补充信息:补充数据可在《生物信息学》在线获取。