Department of Mathematical Sciences, Durham University, Stockton Road, Durham, DH1 3LE, UK.
Department of Genetics, Yale School of Medicine, 333 Cedar Street, New Haven, CT, 06520, USA.
BMC Bioinformatics. 2024 Feb 28;25(1):86. doi: 10.1186/s12859-024-05688-8.
Approximating the recent phylogeny of N phased haplotypes at a set of variants along the genome is a core problem in modern population genomics and central to performing genome-wide screens for association, selection, introgression, and other signals. The Li & Stephens (LS) model provides a simple yet powerful hidden Markov model for inferring the recent ancestry at a given variant, represented as an distance matrix based on posterior decodings.
We provide a high-performance engine to make these posterior decodings readily accessible with minimal pre-processing via an easy to use package kalis, in the statistical programming language R. kalis enables investigators to rapidly resolve the ancestry at loci of interest and developers to build a range of variant-specific ancestral inference pipelines on top. kalis exploits both multi-core parallelism and modern CPU vector instruction sets to enable scaling to hundreds of thousands of genomes.
The resulting distance matrices accessible via kalis enable local ancestry, selection, and association studies in modern large scale genomic datasets.
在基因组上的一组变体处,近似于 N 相单倍型的最近系统发育是现代群体基因组学中的一个核心问题,也是进行全基因组关联、选择、渐渗和其他信号检测的关键。Li 和 Stephens(LS)模型提供了一种简单而强大的隐马尔可夫模型,用于推断给定变体的最近祖先,该模型表示为基于后验解码的距离矩阵。
我们提供了一个高性能引擎,通过一个简单易用的 R 编程语言中的 kalis 包,以最小的预处理来实现这些后验解码,使它们易于访问。kalis 使研究人员能够快速解析感兴趣的基因座的祖先,并使开发人员能够在此基础上构建一系列特定于变体的祖先推断管道。kalis 利用多核并行性和现代 CPU 向量指令集来实现对数十万基因组的扩展。
通过 kalis 访问的生成距离矩阵可用于现代大规模基因组数据集的局部祖先、选择和关联研究。