Department of Mathematics, Rowan University, Glassboro, NJ, 08028, USA.
BMC Bioinformatics. 2021 Mar 25;22(1):155. doi: 10.1186/s12859-021-04053-3.
Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging.
In simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects.
Compared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254-65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.
近年来,模型平均法在分析高维数据方面受到了越来越多的关注。通过适当加权几个竞争的统计模型,模型平均法试图实现稳定和改进的预测。在本文中,我们开发了一种两阶段的模型平均程序,以提高高维线性回归预测的准确性和稳定性。首先,我们采用高维变量选择方法(如 LASSO)来筛选冗余预测因子,并构建一类候选模型,然后应用刀切交叉验证来优化模型权重进行平均。
在模拟研究中,在所提出的技术在高维回归设置下,在最小化均方预测误差方面优于常用的替代方法。我们将所提出的方法应用于核黄素数据,结果表明,当有数千个基因和只有几十个样本时,该方法在预测核黄素生产率方面非常有效。
与最近的一种高维模型平均程序(Ando 和 Li 在 J Am Stat Assoc 109:254-65, 2014)相比,所提出的方法具有三个吸引人的特点,因此具有更好的预测性能:(1) 应用更合适的方法进行模型构建和加权。(2) 保留计算灵活性,因为每个候选模型及其对应的权重都是在低维设置中确定的,并且在交叉验证中利用二次规划。(3) 模型选择和平均在程序中结合,因此充分利用了两种技术的优势。因此,所提出的方法可以在高维线性模型中实现稳定和准确的预测,并可以极大地帮助实际研究人员分析医学研究中的遗传数据。