Biostatistics Division, HRB Clinical Research Facility, National University of Ireland Galway, Galway, Ireland.
Department of Statistics and Data Science, Yale University, New Haven, CT, USA.
Bioinformatics. 2020 Jan 1;36(1):177-185. doi: 10.1093/bioinformatics/btz471.
In bioinformatics, genome-wide experiments look for important biological differences between two groups at a large number of locations in the genome. Often, the final analysis focuses on a P-value-based ranking of locations which might then be investigated further in follow-up experiments. However, this strategy may result in small effect sizes, with low P-values, being ranked more favorably than larger more scientifically important effects. Bayesian ranking techniques may offer a solution to this problem provided a good prior distribution for the collective distribution of effect sizes is available.
We develop an Empirical Bayes ranking algorithm, using the marginal distribution of the data over all locations to estimate an appropriate prior. In simulations and analysis using real datasets, we demonstrate favorable performance compared to ordering P-values and a number of other competing ranking methods. The algorithm is computationally efficient and can be used to rank the entirety of genomic locations or to rank a subset of locations, pre-selected via traditional FWER/FDR methods in a 2-stage analysis.
An R-package, EBrank, implementing the ranking algorithm is available on CRAN.
Supplementary data are available at Bioinformatics online.
在生物信息学中,全基因组实验旨在寻找基因组中大量位置上两组之间的重要生物学差异。通常,最终分析侧重于基于 P 值的位置排序,然后可能在后续实验中进一步研究这些位置。然而,这种策略可能导致小的效应量,低 P 值的位置被排名更有利,而更大更有科学意义的效应则排名较低。贝叶斯排序技术可以提供一种解决方案,前提是可以获得效应大小的总体分布的良好先验分布。
我们开发了一种经验贝叶斯排序算法,使用数据在所有位置上的边缘分布来估计适当的先验分布。在模拟和使用真实数据集的分析中,与排序 P 值和许多其他竞争排序方法相比,我们展示了良好的性能。该算法计算效率高,可用于对全基因组位置进行排序,也可用于对通过 2 阶段分析中传统的 FWER/FDR 方法预选的位置子集进行排序。
一个实现排序算法的 R 包 EBrank 可在 CRAN 上获得。
补充数据可在 Bioinformatics 在线获得。