Mehra Somya, Neafsey Daniel E, White Michael, Taylor Aimee R
Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.
Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
G3 (Bethesda). 2025 May 8;15(5). doi: 10.1093/g3journal/jkaf018.
Genetic studies of Plasmodium parasites increasingly feature relatedness estimates. However, various aspects of malaria parasite relatedness estimation are not fully understood. For example, relatedness estimates based on whole-genome-sequence (WGS) data often exceed those based on sparser data types. Systematic bias in relatedness estimation is well documented in the literature geared towards diploid organisms, but largely unknown within the malaria community. We characterize systematic bias in malaria parasite relatedness estimation using three complementary approaches: theoretically, under a non-ancestral statistical model of pairwise relatedness; numerically, under a simulation model of ancestry; and empirically, using data on parasites sampled from Guyana and Colombia. We show that allele frequency estimates encode, locus-by-locus, relatedness averaged over the set of sampled parasites used to compute them. Plugging sample allele frequencies into models of pairwise relatedness can lead to systematic underestimation. However, systematic underestimation can be viewed as population-relatedness calibration, i.e., a way of generating measures of relative relatedness. Systematic underestimation is unavoidable when relatedness is estimated assuming independence between genetic markers. It is mitigated when relatedness is estimated using WGS data under a hidden Markov model (HMM) that exploits linkage between proximal markers. The extent of mitigation is unknowable when a HMM is fit to sparser data, but downstream analyses that use high relatedness thresholds are relatively robust regardless. In summary, practitioners can either resolve to use relative relatedness estimated under independence, or try to estimate absolute relatedness under a HMM. We propose various tools to help practitioners evaluate their situation on a case-by-case basis.
疟原虫的基因研究越来越多地涉及亲缘关系估计。然而,疟疾寄生虫亲缘关系估计的各个方面尚未完全被理解。例如,基于全基因组序列(WGS)数据的亲缘关系估计往往超过基于更稀疏数据类型的估计。在针对二倍体生物的文献中,亲缘关系估计中的系统偏差已有充分记载,但在疟疾领域基本上还不为人知。我们使用三种互补方法来描述疟疾寄生虫亲缘关系估计中的系统偏差:理论上,在成对亲缘关系的非祖先统计模型下;数值上,在祖先模拟模型下;实证上,使用从圭亚那和哥伦比亚采集的寄生虫数据。我们表明,等位基因频率估计在逐个位点上编码了用于计算它们的一组采样寄生虫的平均亲缘关系。将样本等位基因频率代入成对亲缘关系模型可能会导致系统低估。然而,系统低估可以被视为群体亲缘关系校准,即一种生成相对亲缘关系度量的方法。当假设遗传标记之间独立来估计亲缘关系时,系统低估是不可避免的。当在利用近端标记之间连锁的隐马尔可夫模型(HMM)下使用WGS数据估计亲缘关系时,这种低估会得到缓解。当HMM应用于更稀疏的数据时,缓解程度是不可知的,但无论如何,使用高亲缘关系阈值的下游分析相对稳健。总之,从业者可以决定使用在独立性假设下估计的相对亲缘关系,或者尝试在HMM下估计绝对亲缘关系。我们提出了各种工具来帮助从业者逐案评估他们的情况。