Zhao Bingxin, Zheng Shurong, Zhu Hongtu
Department of Statistics and Data Science, University of Pennsylvania.
School of Mathematics and Statistics, Northeast Normal University.
Ann Stat. 2024 Jun;52(3):948-965. doi: 10.1214/24-aos2378. Epub 2024 Aug 11.
Genetic prediction holds immense promise for translating genetic discoveries into medical advances. As the high-dimensional covariance matrix (or the linkage disequilibrium (LD) pattern) of genetic variants often presents a block-diagonal structure, numerous methods account for the dependence among variants in predetermined local LD blocks. Moreover, due to privacy considerations and data protection concerns, genetic variant dependence in each LD block is typically estimated from external reference panels rather than the original training data set. This paper presents a unified analysis of blockwise and reference panel-based estimators in a high-dimensional prediction framework without sparsity restrictions. We find that, surprisingly, even when the covariance matrix has a block-diagonal structure with well-defined boundaries, blockwise estimation methods adjusting for local dependence can be substantially less accurate than methods controlling for the whole covariance matrix. Further, estimation methods built on the original training data set and external reference panels are likely to have varying performance in high dimensions, which may reflect the cost of having only access to summary level data from the training data set. This analysis is based on novel results in random matrix theory for block-diagonal covariance matrix. We numerically evaluate our results using extensive simulations and real data analysis in the UK Biobank.
基因预测在将基因发现转化为医学进步方面具有巨大潜力。由于基因变异的高维协方差矩阵(或连锁不平衡(LD)模式)通常呈现块对角结构,许多方法考虑了预先确定的局部LD块中变异之间的依赖性。此外,出于隐私考虑和数据保护问题,每个LD块中的基因变异依赖性通常是根据外部参考面板而非原始训练数据集来估计的。本文在无稀疏性限制的高维预测框架中,对基于块和基于参考面板的估计器进行了统一分析。我们惊奇地发现,即使协方差矩阵具有边界明确的块对角结构,考虑局部依赖性的逐块估计方法可能比控制整个协方差矩阵的方法准确性要低得多。此外,基于原始训练数据集和外部参考面板构建的估计方法在高维情况下可能具有不同的性能,这可能反映了仅能访问训练数据集汇总水平数据的代价。该分析基于块对角协方差矩阵随机矩阵理论的新成果。我们使用英国生物银行的大量模拟和实际数据分析对结果进行了数值评估。