Geibel Johannes, Reimer Christian, Pook Torsten, Weigend Steffen, Weigend Annett, Simianer Henner
Department of Animal Sciences, Animal Breeding and Genetics Group, University of Goettingen, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany.
Center for Integrated Breeding Research, University of Goettingen, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany.
BMC Genomics. 2021 May 12;22(1):340. doi: 10.1186/s12864-021-07663-6.
Population genetic studies based on genotyped single nucleotide polymorphisms (SNPs) are influenced by a non-random selection of the SNPs included in the used genotyping arrays. The resulting bias in the estimation of allele frequency spectra and population genetics parameters like heterozygosity and genetic distances relative to whole genome sequencing (WGS) data is known as SNP ascertainment bias. Full correction for this bias requires detailed knowledge of the array design process, which is often not available in practice. This study suggests an alternative approach to mitigate ascertainment bias of a large set of genotyped individuals by using information of a small set of sequenced individuals via imputation without the need for prior knowledge on the array design.
The strategy was first tested by simulating additional ascertainment bias with a set of 1566 chickens from 74 populations that were genotyped for the positions of the Affymetrix Axiom™ 580 k Genome-Wide Chicken Array. Imputation accuracy was shown to be consistently higher for populations used for SNP discovery during the simulated array design process. Reference sets of at least one individual per population in the study set led to a strong correction of ascertainment bias for estimates of expected and observed heterozygosity, Wright's Fixation Index and Nei's Standard Genetic Distance. In contrast, unbalanced reference sets (overrepresentation of populations compared to the study set) introduced a new bias towards the reference populations. Finally, the array genotypes were imputed to WGS by utilization of reference sets of 74 individuals (one per population) to 98 individuals (additional commercial chickens) and compared with a mixture of individually and pooled sequenced populations. The imputation reduced the slope between heterozygosity estimates of array data and WGS data from 1.94 to 1.26 when using the smaller balanced reference panel and to 1.44 when using the larger but unbalanced reference panel. This generally supported the results from simulation but was less favorable, advocating for a larger reference panel when imputing to WGS.
The results highlight the potential of using imputation for mitigation of SNP ascertainment bias but also underline the need for unbiased reference sets.
基于基因分型单核苷酸多态性(SNP)的群体遗传学研究受到所用基因分型阵列中所包含SNP的非随机选择的影响。在等位基因频率谱估计以及诸如杂合度和相对于全基因组测序(WGS)数据的遗传距离等群体遗传学参数估计中产生的偏差被称为SNP确定偏差。对此偏差进行完全校正需要详细了解阵列设计过程,而这在实际中往往无法获得。本研究提出了一种替代方法,通过借助少量测序个体的信息进行插补来减轻大量基因分型个体的确定偏差,而无需关于阵列设计的先验知识。
该策略首先通过对来自74个群体的1566只鸡进行模拟额外的确定偏差来进行测试,这些鸡针对Affymetrix Axiom™ 580 k全基因组鸡阵列的位置进行了基因分型。结果表明,在模拟阵列设计过程中用于SNP发现的群体,其插补准确性始终更高。研究集中每个群体至少有一个个体的参考集,能显著校正预期和观察到的杂合度、赖特固定指数和内氏标准遗传距离估计中的确定偏差。相比之下,不平衡的参考集(与研究集相比群体的过度代表性)会引入朝向参考群体的新偏差。最后,利用74个个体(每个群体一个)到98个个体(额外的商业鸡)的参考集将阵列基因型插补到WGS,并与单独测序和混合测序群体的混合物进行比较。当使用较小的平衡参考面板时,插补将阵列数据和WGS数据的杂合度估计之间的斜率从1.94降低到1.26,当使用较大但不平衡的参考面板时降低到1.44。这总体上支持了模拟结果,但效果稍差,表明在向WGS进行插补时需要更大的参考面板。
结果突出了使用插补来减轻SNP确定偏差的潜力,但也强调了无偏差参考集的必要性。