NOAA Fisheries, Northwest Fisheries Science Center, Seattle, WA, USA.
Department of Biology, Section for Computational and RNA Biology, University of Copenhagen, Copenhagen, Denmark.
Mol Ecol Resour. 2022 Feb;22(2):503-518. doi: 10.1111/1755-0998.13482. Epub 2021 Sep 7.
In genomic-scale data sets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (df') compared to the nominal degrees of freedom, df. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here, we measured pseudoreplication (quantified by the ratio df'/df) for a common metric of genetic differentiation (F ) and a common measure of linkage disequilibrium between pairs of loci (r ). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated df' and df'/df by measuring the rate of decline in the variance of mean F and mean r as more loci were used. For both indices, df' increases with N and genome size, as expected. However, even for large N and large genomes, df' for mean r plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for F , but df'/df ≤0.01 can occur in data sets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var (F ), producing very conservative confidence intervals. Predicting df' based on our modelling results as a function of N , L, S, and genome size provides a robust way to quantify precision associated with genomic-scale data sets.
在基因组规模的数据集中,基因座在染色体内部紧密聚集,因此提供了相关信息。如果将基因座视为独立的,对其进行平均处理会产生伪复制,从而与名义自由度(df)相比,有效自由度(df')减少。这个问题已经存在一段时间了,但尚未系统地在整个基因组范围内量化其后果。在这里,我们针对遗传分化的常用度量标准(F)和两个基因座之间连锁不平衡的常用度量标准(r),测量了伪复制(通过 df'/df 的比值来量化)。基于使用 SLiM 和 msprime 模型模拟的数据,这些模型允许高效的正向时间和合并模拟,同时精确控制群体血统,我们通过测量随着更多基因座的使用,平均 F 和平均 r 的方差下降速度来估计 df'和 df'/df。对于这两个指标,df'随着 N 和基因组大小的增加而增加,这是预期的。然而,即使对于大 N 和大基因组,平均 r 的 df'在几千个基因座之后趋于平稳,并且方差分量分析表明,限制因素是与抽样个体而不是基因相关的不确定性。对于 F ,伪复制的程度不那么极端,但在使用数万基因座的数据集上,df'/df ≤0.01 可能会发生。常用的块 jackknife 方法始终高估了 var(F),产生了非常保守的置信区间。根据我们的建模结果,将 df'作为 N、L、S 和基因组大小的函数进行预测,可以为量化基因组规模数据集的精度提供一种稳健的方法。