Harris Alexandre M, DeGiorgio Michael
Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802.
Molecular, Cellular, and Integrative Biosciences at the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802.
G3 (Bethesda). 2017 Feb 9;7(2):671-691. doi: 10.1534/g3.116.037168.
Gene diversity, or expected heterozygosity (), is a common statistic for assessing genetic variation within populations. Estimation of this statistic decreases in accuracy and precision when individuals are related or inbred, due to increased dependence among allele copies in the sample. The original unbiased estimator of expected heterozygosity underestimates true population diversity in samples containing relatives, as it only accounts for sample size. More recently, a general unbiased estimator of expected heterozygosity was developed that explicitly accounts for related and inbred individuals in samples. Though unbiased, this estimator's variance is greater than that of the original estimator. To address this issue, we introduce a general unbiased estimator of gene diversity for samples containing related or inbred individuals, which employs the best linear unbiased estimator of allele frequencies, rather than the commonly used sample proportion. We examine the properties of this estimator, [Formula: see text] relative to alternative estimators using simulations and theoretical predictions, and show that it predominantly has the smallest mean squared error relative to others. Further, we empirically assess the performance of [Formula: see text] on a global human microsatellite dataset of 5795 individuals, from 267 populations, genotyped at 645 loci. Additionally, we show that the improved variance of [Formula: see text] leads to improved estimates of the population differentiation statistic, [Formula: see text] which employs measures of gene diversity within its calculation. Finally, we provide an R script, , to compute this estimator from genomic and pedigree data.
基因多样性,即预期杂合度(),是评估种群内遗传变异的常用统计量。当个体存在亲缘关系或近亲繁殖时,由于样本中等位基因拷贝之间的依赖性增加,该统计量的估计准确性和精度会降低。预期杂合度的原始无偏估计量会低估包含亲属的样本中的真实种群多样性,因为它只考虑了样本大小。最近,开发了一种预期杂合度的通用无偏估计量,该估计量明确考虑了样本中的亲缘关系和近亲繁殖个体。尽管无偏,但该估计量的方差大于原始估计量。为了解决这个问题,我们引入了一种针对包含亲缘关系或近亲繁殖个体的样本的基因多样性通用无偏估计量,它采用等位基因频率的最佳线性无偏估计量,而不是常用的样本比例。我们通过模拟和理论预测来检验该估计量相对于其他估计量的性质,结果表明,相对于其他估计量,它的均方误差主要最小。此外,我们在一个由267个种群的5795个个体组成的全球人类微卫星数据集上,对645个基因座进行基因分型,实证评估了该估计量的性能。此外,我们表明,该估计量方差的改善导致了种群分化统计量估计值的改善,种群分化统计量在计算中采用了基因多样性的度量。最后,我们提供了一个R脚本,用于从基因组和系谱数据中计算该估计量。