Jin Yifan, Terhorst Jonathan
Department of Statistics, University of Michigan, 1085 South University Avenue, Ann Arbor, MI, 48103, USA.
Algorithms Mol Biol. 2023 Aug 9;18(1):12. doi: 10.1186/s13015-023-00237-z.
The Li-Stephens (LS) haplotype copying model forms the basis of a number of important statistical inference procedures in genetics. LS is a probabilistic generative model which supposes that a sampled chromosome is an imperfect mosaic of other chromosomes found in a population. In the frequentist setting which is the focus of this paper, the output of LS is a "copying path" through chromosome space. The behavior of LS depends crucially on two user-specified parameters, [Formula: see text] and [Formula: see text], which are respectively interpreted as the rates of mutation and recombination. However, because LS is not based on a realistic model of ancestry, the precise connection between these parameters and the biological phenomena they represent is unclear. Here, we offer an alternative perspective, which considers [Formula: see text] and [Formula: see text] as tuning parameters, and seeks to understand their impact on the LS output. We derive an algorithm which, for a given dataset, efficiently partitions the [Formula: see text] plane into regions where the output of the algorithm is constant, thereby enumerating all possible solutions to the LS model in one go. We extend this approach to the "diploid LS" model commonly used for phasing. We demonstrate the usefulness of our method by studying the effects of changing [Formula: see text] and [Formula: see text] when using LS for common bioinformatic tasks. Our findings indicate that using the conventional (i.e., population-scaled) values for [Formula: see text] and [Formula: see text] produces near optimal results for imputation, but may systematically inflate switch error in the case of phasing diploid genotypes.
李-斯蒂芬斯(LS)单倍型复制模型构成了遗传学中许多重要统计推断程序的基础。LS是一种概率生成模型,它假定一个抽样染色体是群体中其他染色体的不完美镶嵌体。在本文所关注的频率主义框架下,LS的输出是一条穿过染色体空间的“复制路径”。LS的行为关键取决于两个用户指定的参数,[公式:见原文]和[公式:见原文],它们分别被解释为突变率和重组率。然而,由于LS并非基于一个现实的祖先模型,这些参数与其所代表的生物学现象之间的确切联系尚不清楚。在此,我们提供一种不同的视角,将[公式:见原文]和[公式:见原文]视为调整参数,并试图理解它们对LS输出的影响。我们推导了一种算法,对于给定的数据集,该算法能有效地将[公式:见原文]平面划分为算法输出恒定的区域,从而一次性枚举LS模型的所有可能解。我们将此方法扩展到常用于定相的“二倍体LS”模型。通过研究在常见生物信息学任务中使用LS时改变[公式:见原文]和[公式:见原文]的影响,我们证明了我们方法的实用性。我们的研究结果表明,使用[公式:见原文]和[公式:见原文]的传统(即群体尺度)值在插补时能产生接近最优的结果,但在定相二倍体基因型的情况下可能会系统性地夸大切换错误。