Zhou Yi-Hui, Marron J S, Wright Fred A
Bioinformatics Research Center and Department of Biological Sciences, North Carolina State University, North Carolina, U.S.A.
Department of Statistics and Operations Research, University of North Carolina, North Carolina, U.S.A.
Biometrics. 2018 Jun;74(2):439-447. doi: 10.1111/biom.12767. Epub 2017 Aug 29.
Genotype eigenvectors are widely used as covariates for control of spurious stratification in genetic association. Significance testing for the accompanying eigenvalues has typically been based on a standard Tracy-Widom limiting distribution for the largest eigenvalue, derived under white-noise assumptions. It is known that even modest local correlation among markers inflates the largest eigenvalues, even in the absence of true stratification. In addition, a few sample eigenvalues may be extreme, creating further complications in accurate testing. We explore several methods to identify appropriate null eigenvalue thresholds, while remaining sensitive to eigenvalues corresponding to population stratification. We introduce a novel block permutation approach, designed to produce an appropriate null eigenvalue distribution by eliminating long-range genomic correlation while preserving local correlation. We also propose a fast approach based on eigenvalue distribution modeling, using a simple fit criterion and the general Marčenko-Pastur equation under a simple discrete eigenvalue model. Block permutation and the model-based approach work well for pure simulations and for data resampled from the 1000 Genomes project. In contrast, we find that the standard approach of computing an "effective" number of markers does not perform well. The performance of the methods is also demonstrated for a motivating example from the International Cystic Fibrosis Consortium.
基因型特征向量被广泛用作协变量,以控制基因关联中虚假分层的影响。伴随特征值的显著性检验通常基于在白噪声假设下推导得出的最大特征值的标准特雷西 - 威多姆极限分布。众所周知,即使标记之间存在适度的局部相关性,也会使最大特征值膨胀,即使在没有真正分层的情况下也是如此。此外,少数样本特征值可能会非常极端,给准确检验带来更多复杂性。我们探索了几种方法来确定合适的零特征值阈值,同时对与群体分层对应的特征值保持敏感性。我们引入了一种新颖的块置换方法,旨在通过消除长程基因组相关性同时保留局部相关性来产生合适的零特征值分布。我们还基于特征值分布建模提出了一种快速方法,在简单离散特征值模型下使用简单的拟合标准和一般的马尔琴科 - 帕斯图尔方程。块置换和基于模型的方法在纯模拟以及从千人基因组计划重采样的数据中表现良好。相比之下,我们发现计算“有效”标记数的标准方法表现不佳。还通过国际囊性纤维化协会的一个激励性示例展示了这些方法的性能。