Martin Simon H
Institute of Ecology and Evolution, School of Biological Sciences, The University of Edinburgh, Edinburgh, EH9 3FL, United Kingdom.
Genetics. 2025 Sep 8. doi: 10.1093/genetics/iyaf181.
Recent advances in methods to infer and analyse ancestral recombination graphs (ARGs) are providing powerful new insights in evolutionary biology and beyond. Existing inference approaches tend to be designed for use with fully-phased datasets, and some rely on model assumptions about demography and recombination rate. Here I describe a simple model-free approach for genealogical inference along the genome from unphased genotype data called Sequential Tree Inference by Collecting Compatible Sites (sticcs). sticcs applies a heuristic algorithm based on the perfect phylogeny principle to reconstruct a local genealogy at each variant site in the genome, using a 'collecting' procedure to import information from other nearby sites. Using simulations, I show that sticcs is accurate for ARG inference, but only when the sample size is small. However, I also describe how it can be applied for the purpose of topology weighting by 'stacking' tree sequences inferred for multiple subsets of the data, removing the sample size restriction. Topology weights estimated in this way from unphased data tend to be more accurate than those computed with full ARGs inferred from perfectly phased data using several popular tools. The methods presented therefore have promise for analysis of relatedness and introgression in non-model species, including polyploids. The new methods are implemented in two Python packages, sticcs (for ARG inference) and twisst2 (for topology weighting using the stacking procedure), both of which interface with the tskit library for analysis of tree sequence objects.
推断和分析祖先重组图(ARG)的方法的最新进展为进化生物学及其他领域提供了强大的新见解。现有的推断方法往往是为全相数据集设计的,有些还依赖于关于种群统计学和重组率的模型假设。在此,我描述一种简单的无模型方法,用于从无相位基因型数据推断基因组上的谱系,称为通过收集兼容位点进行序列树推断(sticcs)。sticcs应用一种基于完美系统发育原理的启发式算法,在基因组中的每个变异位点重建局部谱系,使用“收集”程序从其他附近位点导入信息。通过模拟,我表明sticcs在ARG推断方面是准确的,但仅在样本量较小时如此。然而,我也描述了如何通过“堆叠”为数据的多个子集推断的树序列,将其应用于拓扑加权目的,从而消除样本量限制。以这种方式从未相位数据估计的拓扑权重往往比使用几种流行工具从完美相位数据推断的完整ARG计算出的权重更准确。因此,所提出的方法有望用于分析非模式物种(包括多倍体)中的亲缘关系和基因渗入。这些新方法在两个Python包中实现,sticcs(用于ARG推断)和twisst2(用于使用堆叠程序进行拓扑加权),这两个包都与tskit库接口以分析树序列对象。