Genome Biology Unit, European Molecular Biology Laboratory, Meyerhofstr. 1, 69117 Heidelberg, Germany.
Biostatistics. 2021 Apr 10;22(2):348-364. doi: 10.1093/biostatistics/kxz034.
Penalization schemes like Lasso or ridge regression are routinely used to regress a response of interest on a high-dimensional set of potential predictors. Despite being decisive, the question of the relative strength of penalization is often glossed over and only implicitly determined by the scale of individual predictors. At the same time, additional information on the predictors is available in many applications but left unused. Here, we propose to make use of such external covariates to adapt the penalization in a data-driven manner. We present a method that differentially penalizes feature groups defined by the covariates and adapts the relative strength of penalization to the information content of each group. Using techniques from the Bayesian tool-set our procedure combines shrinkage with feature selection and provides a scalable optimization scheme. We demonstrate in simulations that the method accurately recovers the true effect sizes and sparsity patterns per feature group. Furthermore, it leads to an improved prediction performance in situations where the groups have strong differences in dynamic range. In applications to data from high-throughput biology, the method enables re-weighting the importance of feature groups from different assays. Overall, using available covariates extends the range of applications of penalized regression, improves model interpretability and can improve prediction performance.
惩罚方案,如 Lasso 或岭回归,常用于将感兴趣的响应回归到一组高维的潜在预测因子上。尽管这种方法很有决断性,但惩罚力度的相对强度问题往往被忽略,只是通过个别预测因子的规模来隐含确定。与此同时,许多应用中都有关于预测因子的额外信息,但没有被利用。在这里,我们建议利用这些外部协变量以数据驱动的方式自适应惩罚。我们提出了一种方法,该方法根据协变量对特征组进行差异化惩罚,并根据每个组的信息量自适应调整惩罚的相对强度。我们的方法利用贝叶斯工具集中的技术,将收缩与特征选择相结合,并提供了一种可扩展的优化方案。我们在模拟中证明,该方法可以准确地恢复每个特征组的真实效应大小和稀疏模式。此外,在各组动态范围差异较大的情况下,它可以提高预测性能。在应用于高通量生物学数据时,该方法能够重新加权来自不同检测的特征组的重要性。总的来说,利用可用的协变量扩展了惩罚回归的应用范围,提高了模型的可解释性,并可以提高预测性能。