Weyer Veronika, Binder Harald
Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center Mainz, Johannes Gutenberg-University Mainz, Obere Zahlbacher Strasse 69, Mainz, Germany.
BMC Bioinformatics. 2015 Sep 15;16:294. doi: 10.1186/s12859-015-0716-8.
High-dimensional molecular measurements, e.g. gene expression data, can be linked to clinical time-to-event endpoints by Cox regression models and regularized estimation approaches, such as componentwise boosting, and can incorporate a large number of covariates as well as provide variable selection. If there is heterogeneity due to known patient subgroups, a stratified Cox model allows for separate baseline hazards in each subgroup. Variable selection will still depend on the relative stratum sizes in the data, which might be a convenience sample and not representative for future applications. Such effects need to be systematically investigated and could even help to more reliably identify components of risk prediction signatures.
Correspondingly, we propose a weighted regression approach based on componentwise likelihood-based boosting which is implemented in the R package CoxBoost (https://github.com/binderh/CoxBoost). This approach focuses on building a risk prediction signature for a specific stratum by down-weighting the observations from the other strata using a range of weights. Stability of selection for specific covariates as a function of the weights is investigated by resampling inclusion frequencies, and two types of corresponding visualizations are suggested. This is illustrated for two applications with methylation and gene expression measurements from cancer patients.
The proposed approach is meant to point out components of risk prediction signatures that are specific to the stratum of interest and components that are also important to other strata. Performance is mostly improved by incorporating down-weighted information from the other strata. This suggests more general usefulness for risk prediction signature development in data with heterogeneity due to known subgroups.
高维分子测量,例如基因表达数据,可以通过Cox回归模型和正则化估计方法(如分量式提升)与临床事件发生时间终点相关联,并且可以纳入大量协变量以及进行变量选择。如果由于已知患者亚组存在异质性,分层Cox模型允许在每个亚组中有单独的基线风险。变量选择仍将取决于数据中的相对层大小,而这些数据可能是一个便利样本,对未来应用不具有代表性。需要系统地研究这些影响,甚至可能有助于更可靠地识别风险预测特征的组成部分。
相应地,我们提出了一种基于分量式似然提升的加权回归方法,该方法在R包CoxBoost(https://github.com/binderh/CoxBoost)中实现。这种方法通过使用一系列权重对来自其他层的观测值进行降权,专注于为特定层构建风险预测特征。通过重采样包含频率来研究特定协变量选择的稳定性作为权重的函数,并提出了两种相应的可视化方法。这在两个应用中得到了说明,分别使用了癌症患者的甲基化和基因表达测量数据。
所提出的方法旨在指出特定感兴趣层的风险预测特征的组成部分以及对其他层也很重要的组成部分。通过纳入来自其他层的降权信息,性能大多得到改善。这表明该方法在因已知亚组而具有异质性的数据中的风险预测特征开发方面具有更广泛的实用性。