Jewett Ethan M
23andMe, Inc. Sunnyvale, CA., 94086.
bioRxiv. 2024 Sep 4:2024.05.13.594005. doi: 10.1101/2024.05.13.594005.
The datasets of large genotyping biobanks and direct-to-consumer genetic testing companies contain many related individuals. Until now, it has been widely accepted that the most distant relationships that can be detected are around fifteen degrees (approximately 8 cousins) and that practical relationship estimates have a ceiling around ten degrees (approximately 5 cousins). However, we show that these assumptions are incorrect and that they are due to a misapplication of relationship estimators. In particular, relationship estimators are applied almost exclusively to putative relatives who have been identified because they share detectable tracts of DNA identically by descent (IBD). However, no existing relationship estimator conditions on the event that two individuals share at least one detectable segment of IBD anywhere in the genome. As a result, the relationship estimates obtained using existing estimators are dramatically biased for distant relationships, inferring all sufficiently distant relationships to be around ten degrees regardless of the depth of the true relationship. Existing relationship estimators are derived under a model that assumes that each pair of related individuals shares a single common ancestor (or mating pair of ancestors). This model breaks down for relationships beyond 10 generations in the past because individuals share many thousands of cryptic common ancestors due to pedigree collapse. We first derive a corrected likelihood that conditions on the event that at least one segment is observed between a pair of putative relatives and we demonstrate that the corrected likelihood largely eliminates the bias in estimates of pairwise relationships and provides a more accurate characterization of the uncertainty in these estimates. We then reformulate the relationship inference problem to account for the fact that individuals share many common ancestors, not just one. We demonstrate that the most distant relationship that can be inferred using IBD may be 200 degrees or more, rather than ten, extending the time-to-common ancestor from approximately 300 years in the past to approximately 3,000 years in the past or more. This dramatic increase in the range of relationship estimators makes it possible to infer relationships whose common ancestors lived before historical events such as European settlement of the Americas, the Transatlantic Slave Trade, and the rise and fall of the Roman Empire.
大型基因分型生物样本库和直接面向消费者的基因检测公司的数据集包含许多有亲属关系的个体。到目前为止,人们普遍认为能够检测到的最远亲属关系约为十五度(约8代堂表亲),而实际的亲属关系估计上限约为十度(约5代堂表亲)。然而,我们表明这些假设是不正确的,并且是由于亲属关系估计器的错误应用导致的。具体而言,亲属关系估计器几乎完全应用于那些因通过血缘相同地共享可检测到的DNA片段(IBD)而被识别出的假定亲属。然而,没有现有的亲属关系估计器考虑到两个个体在基因组的任何位置共享至少一个可检测到的IBD片段这一事件。因此,使用现有估计器获得的亲属关系估计对于远亲关系存在极大偏差,无论真实关系的深度如何,都将所有足够远的关系推断为约十度。现有的亲属关系估计器是在一个假设每对相关个体共享一个共同祖先(或一对祖先配偶)的模型下推导出来的。由于系谱崩溃,个体共享数千个隐秘的共同祖先,这个模型在追溯到过去超过10代的关系中就不再适用。我们首先推导了一个校正似然,它基于一对假定亲属之间至少观察到一个片段这一事件,并且我们证明校正似然在很大程度上消除了成对关系估计中的偏差,并为这些估计中的不确定性提供了更准确的描述。然后,我们重新构建亲属关系推断问题,以考虑到个体共享许多共同祖先,而不仅仅是一个。我们证明,使用IBD能够推断出的最远亲属关系可能达到200度或更高,而不是十度,将共同祖先的时间追溯从过去大约300年延长到过去大约3000年或更久。亲属关系估计范围的这种显著扩大使得推断其共同祖先生活在诸如欧洲人定居美洲、跨大西洋奴隶贸易以及罗马帝国兴衰等历史事件之前的亲属关系成为可能。