Janson Lucas, Barber Rina Foygel, Candès Emmanuel
J R Stat Soc Series B Stat Methodol. 2017 Sep;79(4):1037-1065. doi: 10.1111/rssb.12203. Epub 2016 Sep 16.
Consider the following three important problems in statistical inference, namely, constructing confidence intervals for (1) the error of a high-dimensional ( > ) regression estimator, (2) the linear regression noise level, and (3) the genetic signal-to-noise ratio of a continuous-valued trait (related to the heritability). All three problems turn out to be closely related to the little-studied problem of performing inference on the [Formula: see text]-norm of the signal in high-dimensional linear regression. We derive a novel procedure for this, which is asymptotically correct when the covariates are multivariate Gaussian and produces valid confidence intervals in finite samples as well. The procedure, called , is computationally fast and makes no assumptions on coefficient sparsity or knowledge of the noise level. We investigate the width of the EigenPrism confidence intervals, including a comparison with a Bayesian setting in which our interval is just 5% wider than the Bayes credible interval. We are then able to unify the three aforementioned problems by showing that the EigenPrism procedure with only minor modifications is able to make important contributions to all three. We also investigate the robustness of coverage and find that the method applies in practice and in finite samples much more widely than just the case of multivariate Gaussian covariates. Finally, we apply EigenPrism to a genetic dataset to estimate the genetic signal-to-noise ratio for a number of continuous phenotypes.
考虑统计推断中的以下三个重要问题,即:为(1)高维(>)回归估计量的误差、(2)线性回归噪声水平以及(3)连续值性状的遗传信噪比(与遗传力相关)构建置信区间。事实证明,所有这三个问题都与高维线性回归中对信号的[公式:见正文]范数进行推断这个研究较少的问题密切相关。我们为此推导了一种新颖的方法,当协变量是多元高斯分布时,该方法在渐近意义上是正确的,并且在有限样本中也能产生有效的置信区间。这个方法称为EigenPrism,计算速度快,并且对系数稀疏性或噪声水平的知识不做任何假设。我们研究了EigenPrism置信区间的宽度,包括与贝叶斯设置进行比较,在贝叶斯设置中我们的区间仅比贝叶斯可信区间宽5%。然后,我们能够通过表明只需进行微小修改的EigenPrism方法就能对上述所有三个问题做出重要贡献,从而将这三个问题统一起来。我们还研究了覆盖率的稳健性,发现该方法在实际应用和有限样本中的适用范围比多元高斯协变量的情况要广泛得多。最后,我们将EigenPrism应用于一个遗传数据集,以估计多个连续表型的遗传信噪比。