Hieke Stefanie, Benner Axel, Schlenk Richard F, Schumacher Martin, Bullinger Lars, Binder Harald
Institute for Medical Biometry and Statistics, Medical Center- University Freiburg, Freiburg, Germany.
Freiburg Center for Data Analysis and Modeling, University Freiburg, Freiburg, Germany.
PLoS One. 2016 May 9;11(5):e0155226. doi: 10.1371/journal.pone.0155226. eCollection 2016.
Clinical cohorts with time-to-event endpoints are increasingly characterized by measurements of a number of single nucleotide polymorphisms that is by a magnitude larger than the number of measurements typically considered at the gene level. At the same time, the size of clinical cohorts often is still limited, calling for novel analysis strategies for identifying potentially prognostic SNPs that can help to better characterize disease processes. We propose such a strategy, drawing on univariate testing ideas from epidemiological case-controls studies on the one hand, and multivariable regression techniques as developed for gene expression data on the other hand. In particular, we focus on stable selection of a small set of SNPs and corresponding genes for subsequent validation. For univariate analysis, a permutation-based approach is proposed to test at the gene level. We use regularized multivariable regression models for considering all SNPs simultaneously and selecting a small set of potentially important prognostic SNPs. Stability is judged according to resampling inclusion frequencies for both the univariate and the multivariable approach. The overall strategy is illustrated with data from a cohort of acute myeloid leukemia patients and explored in a simulation study. The multivariable approach is seen to automatically focus on a smaller set of SNPs compared to the univariate approach, roughly in line with blocks of correlated SNPs. This more targeted extraction of SNPs results in more stable selection at the SNP as well as at the gene level. Thus, the multivariable regression approach with resampling provides a perspective in the proposed analysis strategy for SNP data in clinical cohorts highlighting what can be added by regularized regression techniques compared to univariate analyses.
具有事件发生时间终点的临床队列越来越多地通过大量单核苷酸多态性的测量来表征,其数量级比通常在基因水平上考虑的测量数量大得多。与此同时,临床队列的规模往往仍然有限,这就需要新的分析策略来识别潜在的预后性单核苷酸多态性,以帮助更好地表征疾病过程。我们提出了这样一种策略,一方面借鉴了流行病学病例对照研究中的单变量测试思想,另一方面借鉴了为基因表达数据开发的多变量回归技术。特别是,我们专注于稳定选择一小部分单核苷酸多态性和相应的基因以供后续验证。对于单变量分析,我们提出了一种基于排列的方法在基因水平上进行测试。我们使用正则化多变量回归模型同时考虑所有单核苷酸多态性,并选择一小部分潜在重要的预后性单核苷酸多态性。根据单变量和多变量方法的重采样包含频率来判断稳定性。通过急性髓系白血病患者队列的数据说明了总体策略,并在模拟研究中进行了探索。与单变量方法相比,多变量方法被认为会自动聚焦于更小的单核苷酸多态性集合,大致与相关单核苷酸多态性的区块一致。这种更有针对性的单核苷酸多态性提取在单核苷酸多态性以及基因水平上导致更稳定的选择。因此,带有重采样的多变量回归方法在所提出的临床队列单核苷酸多态性数据的分析策略中提供了一个视角,突出了与单变量分析相比正则化回归技术可以增加什么。