Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China.
School of Data Science, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China.
Biometrics. 2023 Jun;79(2):891-902. doi: 10.1111/biom.13691. Epub 2022 May 19.
Inference of population structure from genetic data plays an important role in population and medical genetics studies. With the advancement and decreasing cost of sequencing technology, the increasingly available whole genome sequencing data provide much richer information about the underlying population structure. The traditional method originally developed for array-based genotype data for computing and selecting top principal components (PCs) that capture population structure may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample-to-marker ratio is nearly zero, violating the assumption of the Tracy-Widom test used in their method. Second, their method might not be able to handle the linkage disequilibrium well in sequencing data. To resolve those two practical issues, we propose a new method called ERStruct to determine the number of top informative PCs based on sequencing data. More specifically, we propose to use the ratio of consecutive eigenvalues as a more robust test statistic, and then we approximate its null distribution using modern random matrix theory. Both simulation studies and applications to two public data sets from the HapMap 3 and the 1000 Genomes Projects demonstrate the empirical performance of our ERStruct method.
从遗传数据中推断种群结构在人口和医学遗传学研究中起着重要作用。随着测序技术的进步和成本的降低,越来越多的全基因组测序数据为潜在的种群结构提供了更丰富的信息。传统的方法最初是为基于阵列的基因型数据开发的,用于计算和选择捕获种群结构的顶级主成分 (PC),但由于两个原因,该方法在测序数据上的性能可能不佳。首先,在测序数据中,遗传变异数 p 远大于样本量 n,以至于样本与标记的比例 几乎为零,违反了他们方法中使用的 Tracy-Widom 检验的假设。其次,他们的方法可能无法很好地处理测序数据中的连锁不平衡。为了解决这两个实际问题,我们提出了一种称为 ERStruct 的新方法,用于根据测序数据确定顶级信息丰富 PC 的数量。更具体地说,我们建议使用连续特征值的比值作为更稳健的检验统计量,然后使用现代随机矩阵理论来近似其零分布。模拟研究和对 HapMap 3 和 1000 基因组计划两个公共数据集的应用表明,我们的 ERStruct 方法具有良好的经验性能。