Slater Noa, Louzoun Yoram, Gragert Loren, Maiers Martin, Chatterjee Ansu, Albrecht Mark
Gonda Brain Research Center, Bar-Ilan University, Ramat Gan, Israel.
Gonda Brain Research Center, Bar-Ilan University, Ramat Gan, Israel; Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel.
PLoS Comput Biol. 2015 Apr 22;11(4):e1004204. doi: 10.1371/journal.pcbi.1004204. eCollection 2015 Apr.
Measures of allele and haplotype diversity, which are fundamental properties in population genetics, often follow heavy tailed distributions. These measures are of particular interest in the field of hematopoietic stem cell transplant (HSCT). Donor/Recipient suitability for HSCT is determined by Human Leukocyte Antigen (HLA) similarity. Match predictions rely upon a precise description of HLA diversity, yet classical estimates are inaccurate given the heavy-tailed nature of the distribution. This directly affects HSCT matching and diversity measures in broader fields such as species richness. We, therefore, have developed a power-law based estimator to measure allele and haplotype diversity that accommodates heavy tails using the concepts of regular variation and occupancy distributions. Application of our estimator to 6.59 million donors in the Be The Match Registry revealed that haplotypes follow a heavy tail distribution across all ethnicities: for example, 44.65% of the European American haplotypes are represented by only 1 individual. Indeed, our discovery rate of all U.S. European American haplotypes is estimated at 23.45% based upon sampling 3.97% of the population, leaving a large number of unobserved haplotypes. Population coverage, however, is much higher at 99.4% given that 90% of European Americans carry one of the 4.5% most frequent haplotypes. Alleles were found to be less diverse suggesting the current registry represents most alleles in the population. Thus, for HSCT registries, haplotype discovery will remain high with continued recruitment to a very deep level of sampling, but population coverage will not. Finally, we compared the convergence of our power-law versus classical diversity estimators such as Capture recapture, Chao, ACE and Jackknife methods. When fit to the haplotype data, our estimator displayed favorable properties in terms of convergence (with respect to sampling depth) and accuracy (with respect to diversity estimates). This suggests that power-law based estimators offer a valid alternative to classical diversity estimators and may have broad applicability in the field of population genetics.
等位基因和单倍型多样性的测量是群体遗传学的基本属性,通常遵循重尾分布。这些测量在造血干细胞移植(HSCT)领域尤为重要。HSCT的供体/受体适配性由人类白细胞抗原(HLA)相似度决定。匹配预测依赖于HLA多样性的精确描述,但鉴于分布的重尾性质,传统估计并不准确。这直接影响了HSCT匹配以及更广泛领域(如物种丰富度)中的多样性测量。因此,我们开发了一种基于幂律的估计器,利用正则变化和占用分布的概念来测量适应重尾的等位基因和单倍型多样性。将我们的估计器应用于“成为配型登记处”的659万捐赠者,结果显示单倍型在所有种族中都遵循重尾分布:例如,仅1个人就代表了44.65%的欧裔美国人单倍型。事实上,基于对3.97%的人口进行抽样,我们对所有美国欧裔美国人单倍型的发现率估计为23.45%,这意味着有大量未观察到的单倍型。然而,由于90%的欧裔美国人携带4.5%最常见单倍型中的一种,群体覆盖率要高得多,为99.4%。发现等位基因的多样性较低,这表明当前登记处代表了群体中的大多数等位基因。因此,对于HSCT登记处而言,随着持续招募到非常深入的抽样水平,单倍型的发现率仍将很高,但群体覆盖率不会。最后,我们比较了我们的幂律估计器与传统多样性估计器(如捕获再捕获、Chao、ACE和刀切法)的收敛情况。当拟合单倍型数据时,我们的估计器在收敛性(相对于抽样深度)和准确性(相对于多样性估计)方面表现出良好的特性。这表明基于幂律的估计器为传统多样性估计器提供了一种有效的替代方法,并且可能在群体遗传学领域具有广泛的适用性。