Sariyar Murat, Schumacher Martin, Binder Harald
Stat Appl Genet Mol Biol. 2014 Jun;13(3):343-57. doi: 10.1515/sagmb-2013-0050.
Risk prediction models can link high-dimensional molecular measurements, such as DNA methylation, to clinical endpoints. For biological interpretation, often a sparse fit is desirable. Different molecular aggregation levels, such as considering DNA methylation at the CpG, gene, or chromosome level, might demand different degrees of sparsity. Hence, model building and estimation techniques should be able to adapt their sparsity according to the setting. Additionally, underestimation of coefficients, which is a typical problem of sparse techniques, should also be addressed. We propose a comprehensive approach, based on a boosting technique that allows a flexible adaptation of model sparsity and addresses these problems in an integrative way. The main motivation is to have an automatic sparsity adaptation. In a simulation study, we show that this approach reduces underestimation in sparse settings and selects more adequate model sizes than the corresponding non-adaptive boosting technique in non-sparse settings. Using different aggregation levels of DNA methylation data from a study in kidney carcinoma patients, we illustrate how automatically selected values of the sparsity tuning parameter can reflect the underlying structure of the data. In addition to that, prediction performance and variable selection stability is compared to the non-adaptive boosting approach.
风险预测模型可以将高维分子测量结果(如DNA甲基化)与临床终点联系起来。为了进行生物学解释,通常需要稀疏拟合。不同的分子聚集水平,如考虑CpG、基因或染色体水平的DNA甲基化,可能需要不同程度的稀疏性。因此,模型构建和估计技术应该能够根据具体情况调整其稀疏性。此外,系数低估是稀疏技术的一个典型问题,也应该得到解决。我们提出了一种基于提升技术的综合方法,该方法允许灵活调整模型稀疏性,并以综合方式解决这些问题。主要动机是实现自动的稀疏性调整。在一项模拟研究中,我们表明,与非稀疏设置下相应的非自适应提升技术相比,该方法在稀疏设置中减少了低估,并选择了更合适的模型规模。利用来自肾癌患者研究的不同聚集水平的DNA甲基化数据,我们说明了稀疏性调整参数的自动选择值如何反映数据的潜在结构。除此之外,还将预测性能和变量选择稳定性与非自适应提升方法进行了比较。