Yang Xingli, Wang Yu, Yan Wennan, Li Jihong
School of Mathematical Sciences, Shanxi University, Taiyuan, People's Republic of China.
School of Modern Educational Technology, Shanxi University, Taiyuan, People's Republic of China.
J Appl Stat. 2020 Jun 18;48(11):1934-1947. doi: 10.1080/02664763.2020.1780571. eCollection 2021.
In high-dimensional linear regression, the dimension of variables is always greater than the sample size. In this situation, the traditional variance estimation technique based on ordinary least squares constantly exhibits a high bias even under sparsity assumption. One of the major reasons is the high spurious correlation between unobserved realized noise and several predictors. To alleviate this problem, a refitted cross-validation (RCV) method has been proposed in the literature. However, for a complicated model, the RCV exhibits a lower probability that the selected model includes the true model in case of finite samples. This phenomenon may easily result in a large bias of variance estimation. Thus, a model selection method based on the ranks of the frequency of occurrences in six votes from a blocked 3×2 cross-validation is proposed in this study. The proposed method has a considerably larger probability of including the true model in practice than the RCV method. The variance estimation obtained using the model selected by the proposed method also shows a lower bias and a smaller variance. Furthermore, theoretical analysis proves the asymptotic normality property of the proposed variance estimation.
在高维线性回归中,变量的维度总是大于样本量。在这种情况下,即使在稀疏性假设下,基于普通最小二乘法的传统方差估计技术也始终表现出高偏差。主要原因之一是未观察到的实际噪声与几个预测变量之间存在高度虚假相关性。为了缓解这个问题,文献中提出了一种重新拟合交叉验证(RCV)方法。然而,对于一个复杂的模型,在有限样本的情况下,RCV显示所选模型包含真实模型的概率较低。这种现象可能很容易导致方差估计出现较大偏差。因此,本研究提出了一种基于在分块3×2交叉验证的六次投票中出现频率排名的模型选择方法。所提出的方法在实际应用中包含真实模型的概率比RCV方法大得多。使用所提出的方法选择的模型获得的方差估计也显示出较低的偏差和较小的方差。此外,理论分析证明了所提出的方差估计的渐近正态性。