Hieke Stefanie, Benner Axel, Schlenl Richard F, Schumacher Martin, Bullinger Lars, Binder Harald
Institute for Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Stefan-Meier-Str. 26, Freiburg, 79104, Germany.
Freiburg Center for Data Analysis and Modeling, University Freiburg, Eckerstr. 1, Freiburg, 79104, Germany.
BMC Bioinformatics. 2016 Aug 30;17(1):327. doi: 10.1186/s12859-016-1183-6.
High-throughput technology allows for genome-wide measurements at different molecular levels for the same patient, e.g. single nucleotide polymorphisms (SNPs) and gene expression. Correspondingly, it might be beneficial to also integrate complementary information from different molecular levels when building multivariable risk prediction models for a clinical endpoint, such as treatment response or survival. Unfortunately, such a high-dimensional modeling task will often be complicated by a limited overlap of molecular measurements at different levels between patients, i.e. measurements from all molecular levels are available only for a smaller proportion of patients.
We propose a sequential strategy for building clinical risk prediction models that integrate genome-wide measurements from two molecular levels in a complementary way. To deal with partial overlap, we develop an imputation approach that allows us to use all available data. This approach is investigated in two acute myeloid leukemia applications combining gene expression with either SNP or DNA methylation data. After obtaining a sparse risk prediction signature e.g. from SNP data, an automatically selected set of prognostic SNPs, by componentwise likelihood-based boosting, imputation is performed for the corresponding linear predictor by a linking model that incorporates e.g. gene expression measurements. The imputed linear predictor is then used for adjustment when building a prognostic signature from the gene expression data. For evaluation, we consider stability, as quantified by inclusion frequencies across resampling data sets. Despite an extremely small overlap in the application example with gene expression and SNPs, several genes are seen to be more stably identified when taking the (imputed) linear predictor from the SNP data into account. In the application with gene expression and DNA methylation, prediction performance with respect to survival also indicates that the proposed approach might work well.
We consider imputation of linear predictor values to be a feasible and sensible approach for dealing with partial overlap in complementary integrative analysis of molecular measurements at different levels. More generally, these results indicate that a complementary strategy for integrating different molecular levels can result in more stable risk prediction signatures, potentially providing a more reliable insight into the underlying biology.
高通量技术能够在同一患者的不同分子水平上进行全基因组测量,例如单核苷酸多态性(SNP)和基因表达。相应地,在构建针对临床终点(如治疗反应或生存)的多变量风险预测模型时,整合来自不同分子水平的互补信息可能会有所助益。不幸的是,这样一个高维建模任务常常会因患者之间不同水平分子测量的重叠有限而变得复杂,即只有较小比例的患者可获得所有分子水平的测量数据。
我们提出了一种构建临床风险预测模型的序贯策略,该策略以互补方式整合来自两个分子水平的全基因组测量数据。为处理部分重叠问题,我们开发了一种插补方法,使我们能够利用所有可用数据。此方法在两个急性髓系白血病应用中进行了研究,这两个应用将基因表达与SNP或DNA甲基化数据相结合。在通过基于分量似然的boosting获得例如来自SNP数据的稀疏风险预测特征(一组自动选择的预后SNP)后,通过一个包含例如基因表达测量值的链接模型对相应的线性预测器进行插补。然后,在根据基因表达数据构建预后特征时,将插补后的线性预测器用于调整。为进行评估,我们考虑稳定性,通过重采样数据集的包含频率来量化。尽管在基因表达和SNP的应用示例中重叠极小,但考虑来自SNP数据的(插补后)线性预测器时,有几个基因被发现能更稳定地被识别。在基因表达和DNA甲基化的应用中,关于生存的预测性能也表明所提出的方法可能效果良好。
我们认为对线性预测器值进行插补是处理不同水平分子测量的互补综合分析中部分重叠问题的一种可行且合理的方法。更一般地说,这些结果表明整合不同分子水平的互补策略可导致更稳定的风险预测特征,有可能为潜在生物学提供更可靠的见解。