Huang Zhendong, Kelleher Jerome, Chan Yao-Ban, Balding David
Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia.
Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom.
PLoS Genet. 2025 Jan 8;21(1):e1011537. doi: 10.1371/journal.pgen.1011537. eCollection 2025 Jan.
Inference of evolutionary and demographic parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that even poorly-inferred short IBD segments can improve estimation. Our mutation-rate estimator achieves precision similar to a previously-published method despite a 4 000-fold reduction in data used for inference, and we identify significant differences between human populations. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.
从基因组序列样本推断进化和人口统计学参数通常首先要推断同源片段(IBD)。通过利用基于祖先重组图(ARG)的高效数据编码,我们相对于当前方法获得了三个主要优势:(i)无需对IBD片段施加长度阈值;(ii)可以在无需难以验证的无重组要求的情况下定义IBD;(iii)仅使用与样本大小成线性比例的一组序列对中的IBD片段,在统计效率损失很小的情况下可以减少计算时间。当从模拟数据中可获得真实的IBD信息时,我们首先展示了强大的推断能力。对于从实际数据推断出的IBD,我们提出了一种近似贝叶斯计算推断算法,并用它表明即使是推断不佳的短IBD片段也可以改善估计。尽管用于推断的数据减少了4000倍,但我们的突变率估计器仍能达到与先前发表的方法相似的精度,并且我们确定了不同人群之间的显著差异。计算成本限制了我们方法中的模型复杂性,但我们能够纳入未知的干扰参数和模型错误设定,仍然能够找到改进的参数推断。