Institute for Immunity, Transplantation and Infection, School of Medicine, Stanford University, Stanford, California, United States of America.
Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, California, United States of America.
PLoS Comput Biol. 2022 Jun 27;18(6):e1010260. doi: 10.1371/journal.pcbi.1010260. eCollection 2022 Jun.
A major limitation of gene expression biomarker studies is that they are not reproducible as they simply do not generalize to larger, real-world, heterogeneous populations. Frequentist multi-cohort gene expression meta-analysis has been frequently used as a solution to this problem to identify biomarkers that are truly differentially expressed. However, the frequentist meta-analysis framework has its limitations-it needs at least 4-5 datasets with hundreds of samples, is prone to confounding from outliers and relies on multiple-hypothesis corrected p-values. To address these shortcomings, we have created a Bayesian meta-analysis framework for the analysis of gene expression data. Using real-world data from three different diseases, we show that the Bayesian method is more robust to outliers, creates more informative estimates of between-study heterogeneity, reduces the number of false positive and false negative biomarkers and selects more generalizable biomarkers with less data. We have compared the Bayesian framework to a previously published frequentist framework and have developed a publicly available R package for use.
基因表达生物标志物研究的一个主要局限性是,它们不可重现,因为它们根本无法推广到更大、更真实、更多样化的人群。频率派多队列基因表达荟萃分析经常被用作解决这个问题的方法,以确定真正差异表达的生物标志物。然而,频率派荟萃分析框架有其局限性——它至少需要 4-5 个具有数百个样本的数据集,容易受到离群值的干扰,并依赖于多重假设校正的 p 值。为了解决这些缺点,我们创建了一个用于基因表达数据分析的贝叶斯荟萃分析框架。使用来自三种不同疾病的真实数据,我们表明贝叶斯方法对离群值更稳健,对研究间异质性的估计更具信息量,减少了假阳性和假阴性生物标志物的数量,并选择了具有更少数据的更具可推广性的生物标志物。我们将贝叶斯框架与之前发表的频率派框架进行了比较,并开发了一个可供使用的公共 R 包。