Department of Ecology and Environmental Sciences, Umeå University, Umeå, Sweden.
Department of Computational Biology, Cornell University, Ithaca, New York, USA.
Mol Ecol Resour. 2023 Oct;23(7):1589-1603. doi: 10.1111/1755-0998.13825. Epub 2023 Jun 20.
The distribution of fitness effects (DFE) of new mutations has been of interest to evolutionary biologists since the concept of mutations arose. Modern population genomic data enable us to quantify the DFE empirically, but few studies have examined how data processing, sample size and cryptic population structure might affect the accuracy of DFE inference. We used simulated and empirical data (from Arabidopsis lyrata) to show the effects of missing data filtering, sample size, number of single nucleotide polymorphisms (SNPs) and population structure on the accuracy and variance of DFE estimates. Our analyses focus on three filtering methods-downsampling, imputation and subsampling-with sample sizes of 4-100 individuals. We show that (1) the choice of missing-data treatment directly affects the estimated DFE, with downsampling performing better than imputation and subsampling; (2) the estimated DFE is less reliable in small samples (<8 individuals), and becomes unpredictable with too few SNPs (<5000, the sum of 0- and 4-fold SNPs); and (3) population structure may skew the inferred DFE towards more strongly deleterious mutations. We suggest that future studies should consider downsampling for small data sets, and use samples larger than 4 (ideally larger than 8) individuals, with more than 5000 SNPs in order to improve the robustness of DFE inference and enable comparative analyses.
新突变的适应度效应(DFE)分布一直是进化生物学家感兴趣的问题,自从突变的概念出现以来。现代群体基因组数据使我们能够从经验上量化 DFE,但很少有研究探讨数据处理、样本量和隐性群体结构如何影响 DFE 推断的准确性。我们使用模拟和实证数据(来自拟南芥)来展示缺失数据过滤、样本量、单核苷酸多态性(SNP)数量和群体结构对 DFE 估计的准确性和方差的影响。我们的分析集中在三种过滤方法——降采样、插补和抽样——以及 4-100 个个体的样本量。我们表明:(1)缺失数据处理的选择直接影响估计的 DFE,降采样比插补和抽样效果更好;(2)在小样本(<8 个个体)中,估计的 DFE 不太可靠,而 SNP 数量太少(<5000,0 倍和 4 倍 SNP 的总和)则变得不可预测;(3)群体结构可能会使推断的 DFE 偏向于更有害的突变。我们建议,未来的研究应该考虑对小数据集进行降采样,并使用大于 4(理想情况下大于 8)个个体、大于 5000 个 SNP 的样本,以提高 DFE 推断的稳健性并能够进行比较分析。