Williams Cole M, Scelza Brooke A, Slack Sarah D, Font-Porterias Neus, Al-Hindi Dana R, Mathias Rasika A, Watson Harold, Barnes Kathleen C, Lange Ethan, Johnson Randi K, Gignoux Christopher R, Ramachandran Sohini, Henn Brenna M
Center for Computational Molecular Biology, Brown University, Providence, RI 02912, USA.
Department of Evolution, Ecology, and Organismal Biology, Brown University, Providence, RI 02912, USA.
Genetics. 2025 Aug 6;230(4). doi: 10.1093/genetics/iyaf094.
Accurate reconstruction of pedigrees from genetic data remains a challenging problem. Many relationship categories (e.g. half-sibships vs avuncular) can be difficult to distinguish without external information. Pedigree inference algorithms are often trained on European-descent families in urban locations. Thus, existing methods tend to perform poorly in endogamous populations for which there may be reticulations within the pedigrees and elevated haplotype sharing. We present a simple, rapid algorithm which initially uses only high-confidence first-degree relationships to seed a machine learning step based on summary statistics of identity-by-descent sharing. One of these statistics, our "haplotype score," is novel and can be used to: (1) distinguish half-sibling pairs from avuncular or grandparent-grandchildren pairs; and (2) assign individuals to ancestor vs descendant generation. We test our approach in a sample of 700 individuals from northern Namibia, sampled from an endogamous population called the Himba. Due to a culture of concurrent relationships in the Himba, there is a high proportion of half-sibships. We accurately identify first through fourth-degree relationships and distinguish between various second-degree relationships: half-sibships, avuncular pairs, and grandparent-grandchildren. We further validate our approach in a second African-descent dataset, the Barbados Asthma Genetics Study, and a European-descent founder population from Quebec. Accurate reconstruction of relatives facilitates estimation of allele frequencies, tracing allele trajectories, improved phasing, heritability and other population genomic questions.
从遗传数据准确重建谱系仍然是一个具有挑战性的问题。如果没有外部信息,许多亲属关系类别(例如半同胞关系与叔侄关系)可能难以区分。谱系推断算法通常在城市地区的欧洲裔家庭中进行训练。因此,现有方法在近亲通婚人群中往往表现不佳,因为这些人群的谱系中可能存在网状结构且单倍型共享程度较高。我们提出了一种简单、快速的算法,该算法最初仅使用高可信度的一级亲属关系来为基于同源性共享汇总统计的机器学习步骤提供种子。这些统计数据之一,即我们的“单倍型分数”,是新颖的,可用于:(1)区分半同胞对与叔侄或祖孙对;(2)将个体分配到祖先或后代世代。我们在来自纳米比亚北部的700名个体样本中测试了我们的方法,这些个体来自一个名为辛巴族的近亲通婚人群。由于辛巴族存在多配偶关系的文化,半同胞关系的比例很高。我们准确地识别了一级到四级亲属关系,并区分了各种二级亲属关系:半同胞关系、叔侄对和祖孙关系。我们在第二个非洲裔数据集——巴巴多斯哮喘遗传学研究以及来自魁北克的欧洲裔创始人群体中进一步验证了我们的方法。亲属关系的准确重建有助于估计等位基因频率、追踪等位基因轨迹、改进基因分型、遗传力以及其他群体基因组问题。