Rashid Naim U, Li Quefeng, Yeh Jen Jen, Ibrahim Joseph G
Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.
Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.
J Am Stat Assoc. 2020;115(531):1125-1138. doi: 10.1080/01621459.2019.1671197. Epub 2019 Oct 29.
In the genomic era, the identification of gene signatures associated with disease is of significant interest. Such signatures are often used to predict clinical outcomes in new patients and aid clinical decision-making. However, recent studies have shown that gene signatures are often not replicable. This occurrence has practical implications regarding the generalizability and clinical applicability of such signatures. To improve replicability, we introduce a novel approach to select gene signatures from multiple datasets whose effects are consistently non-zero and account for between-study heterogeneity. We build our model upon some rank-based quantities, facilitating integration over different genomic datasets. A high dimensional penalized Generalized Linear Mixed Model (pGLMM) is used to select gene signatures and address data heterogeneity. We compare our method to some commonly used strategies that select gene signatures ignoring between-study heterogeneity. We provide asymptotic results justifying the performance of our method and demonstrate its advantage in the presence of heterogeneity through thorough simulation studies. Lastly, we motivate our method through a case study subtyping pancreatic cancer patients from four gene expression studies.
在基因组时代,识别与疾病相关的基因特征备受关注。此类特征常被用于预测新患者的临床结局并辅助临床决策。然而,近期研究表明基因特征往往不可重复。这种情况对于此类特征的普遍性和临床适用性具有实际影响。为提高可重复性,我们引入一种新方法,从多个数据集选择基因特征,其效应始终非零且考虑研究间的异质性。我们基于一些基于秩的量构建模型,便于整合不同的基因组数据集。使用高维惩罚广义线性混合模型(pGLMM)来选择基因特征并解决数据异质性问题。我们将我们的方法与一些忽略研究间异质性来选择基因特征的常用策略进行比较。我们提供渐近结果以证明我们方法的性能,并通过全面的模拟研究证明其在存在异质性时的优势。最后,我们通过对来自四项基因表达研究的胰腺癌患者进行亚型分析的案例研究来推动我们的方法。