Huang Zhendong, Kelleher Jerome, Chan Yao-Ban, Balding David J
Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia.
Oxford Big Data Institute, University of Oxford, United Kingdom.
bioRxiv. 2024 Mar 13:2024.03.07.583855. doi: 10.1101/2024.03.07.583855.
Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.
从基因组序列样本推断人口统计学和进化参数通常首先通过推断同源基因组片段(IBD)来进行。通过利用基于祖先重组图(ARG)的高效数据编码,我们相对于当前方法获得了三个主要优势:(i)无需对IBD片段施加长度阈值,(ii)可以在无需难以验证的无重组要求的情况下定义IBD,以及(iii)仅使用一组与样本大小成线性比例的序列对中的IBD片段,在统计效率损失很小的情况下可以减少计算时间。当从模拟数据中可获得真实的IBD信息时,我们首先展示了强大的推断能力。对于从真实数据推断出的IBD,我们提出了一种近似贝叶斯计算推断算法,并使用它来表明推断不佳的短IBD片段可以提高估计精度。尽管用于推断的数据减少了4000倍,但我们展示出了与先前发表的估计器相似的估计精度。计算成本限制了我们方法中的模型复杂性,但我们能够纳入未知的干扰参数和模型错误设定,仍然能够找到改进的参数推断。