School of Biological Sciences, University of Western Australia, Crawley, Western Australia, Australia.
Mol Ecol Resour. 2024 May;24(4):e13947. doi: 10.1111/1755-0998.13947. Epub 2024 Mar 3.
Genetic diversity is frequently described using heterozygosity, particularly in a conservation context. Often, it is estimated using single nucleotide polymorphisms (SNPs); however, it has been shown that heterozygosity values calculated from SNPs can be biased by both study design and filtering parameters. Though solutions have been proposed to address these issues, our own work has found them to be inadequate in some circumstances. Here, we aimed to improve the reliability and comparability of heterozygosity estimates, specifically by investigating how sample size and missing data thresholds influenced the calculation of autosomal heterozygosity (heterozygosity calculated from across the genome, i.e. fixed and variable sites). We also explored how the standard practice of tri- and tetra-allelic site exclusion could bias heterozygosity estimates and influence eventual conclusions relating to genetic diversity. Across three distinct taxa (a frog, Litoria rubella; a tree, Eucalyptus microcarpa; and a grasshopper, Keyacris scurra), we found heterozygosity estimates to be meaningfully affected by sample size and missing data thresholds, partly due to the exclusion of tri- and tetra-allelic sites. These biases were inconsistent both between species and populations, with more diverse populations tending to have their estimates more severely affected, thus having potential to dramatically alter interpretations of genetic diversity. We propose a modified framework for calculating heterozygosity that reduces bias and improves the utility of heterozygosity as a measure of genetic diversity, whilst also highlighting the need for existing population genetic pipelines to be adjusted such that tri- and tetra-allelic sites be included in calculations.
遗传多样性通常使用杂合度来描述,尤其是在保护生物学中。通常使用单核苷酸多态性(SNP)来估计杂合度;然而,已经表明,从 SNP 计算得出的杂合度值可能会受到研究设计和过滤参数的影响。尽管已经提出了解决这些问题的方法,但我们自己的工作发现,在某些情况下,这些方法并不足够。在这里,我们旨在提高杂合度估计的可靠性和可比性,具体方法是研究样本量和缺失数据阈值如何影响常染色体杂合度(从整个基因组计算得出的杂合度,即固定和可变位点)的计算。我们还探讨了三等位和四等位位点排除的标准实践如何偏倚杂合度估计值,并影响与遗传多样性相关的最终结论。通过对三个不同的分类群(一种青蛙,Litoria rubella;一种树,Eucalyptus microcarpa;和一种蝗虫,Keyacris scurra)进行研究,我们发现样本量和缺失数据阈值会对杂合度估计产生显著影响,部分原因是三等位和四等位位点的排除。这些偏差在物种和种群之间不一致,多样性较高的种群的估计值受到的影响更为严重,因此有可能极大地改变对遗传多样性的解释。我们提出了一种改进的杂合度计算框架,该框架可以减少偏倚,提高杂合度作为遗传多样性衡量标准的实用性,同时也强调需要调整现有的群体遗传分析管道,以便将三等位和四等位位点纳入计算。