Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Québec, Canada.
Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA, United States.
Genet Epidemiol. 2021 Dec;45(8):874-890. doi: 10.1002/gepi.22430. Epub 2021 Sep 1.
Medical research increasingly includes high-dimensional regression modeling with a need for error-in-variables methods. The Convex Conditioned Lasso (CoCoLasso) utilizes a reformulated Lasso objective function and an error-corrected cross-validation to enable error-in-variables regression, but requires heavy computations. Here, we develop a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm for modeling high-dimensional data that are only partially corrupted by measurement error. This algorithm separately optimizes the estimation of the uncorrupted and corrupted features in an iterative manner to reduce computational cost, with a specially calibrated formulation of cross-validation error. Through simulations, we show that the BDCoCoLasso algorithm successfully copes with much larger feature sets than CoCoLasso, and as expected, outperforms the naïve Lasso with enhanced estimation accuracy and consistency, as the intensity and complexity of measurement errors increase. Also, a new smoothly clipped absolute deviation penalization option is added that may be appropriate for some data sets. We apply the BDCoCoLasso algorithm to data selected from the UK Biobank. We develop and showcase the utility of covariate-adjusted genetic risk scores for body mass index, bone mineral density, and lifespan. We demonstrate that by leveraging more information than the naïve Lasso in partially corrupted data, the BDCoCoLasso may achieve higher prediction accuracy. These innovations, together with an R package, BDCoCoLasso, make error-in-variables adjustments more accessible for high-dimensional data sets. We posit the BDCoCoLasso algorithm has the potential to be widely applied in various fields, including genomics-facilitated personalized medicine research.
医学研究越来越多地包含具有变量误差方法需求的高维回归建模。凸条件套索(CoCoLasso)利用重新制定的套索目标函数和错误校正的交叉验证来实现变量误差回归,但需要大量计算。在这里,我们开发了一种用于建模仅部分受测量误差影响的高维数据的块坐标下降凸条件套索(BDCoCoLasso)算法。该算法以迭代方式分别优化未受干扰和受干扰特征的估计,以降低计算成本,并对交叉验证误差进行特别校准。通过模拟,我们表明 BDCoCoLasso 算法成功应对比 CoCoLasso 大得多的特征集,并且如预期的那样,随着测量误差的强度和复杂性的增加,它通过提高估计准确性和一致性,优于朴素套索。此外,还添加了一个新的平滑剪辑绝对偏差惩罚选项,该选项可能适用于某些数据集。我们将 BDCoCoLasso 算法应用于从英国生物银行中选择的数据。我们开发并展示了用于体重指数、骨矿物质密度和寿命的协变量调整遗传风险评分的实用性。我们证明,通过利用部分受干扰数据中比朴素套索更多的信息,BDCoCoLasso 可以实现更高的预测准确性。这些创新,连同一个 R 包 BDCoCoLasso,使高维数据集的变量误差调整更容易获得。我们假设 BDCoCoLasso 算法具有在各种领域广泛应用的潜力,包括基因组学促进的个性化医疗研究。