Department of Signal Theory and Communications, Universitat Politecnica de Catalunya, Barcelona 08034, Spain.
Department of Biomedical Data Science, Stanford Medical School.
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii27-ii33. doi: 10.1093/bioinformatics/btac464.
Local ancestry inference (LAI) is the high resolution prediction of ancestry labels along a DNA sequence. LAI is important in the study of human history and migrations, and it is beginning to play a role in precision medicine applications including ancestry-adjusted genome-wide association studies (GWASs) and polygenic risk scores (PRSs). Existing LAI models do not generalize well between species, chromosomes or even ancestry groups, requiring re-training for each different setting. Furthermore, such methods can lack interpretability, which is an important element in each of these applications.
We present SALAI-Net, a portable statistical LAI method that can be applied on any set of species and ancestries (species-agnostic), requiring only haplotype data and no other biological parameters. Inspired by identity by descent methods, SALAI-Net estimates population labels for each segment of DNA by performing a reference matching approach, which leads to an interpretable and fast technique. We benchmark our models on whole-genome data of humans and we test these models' ability to generalize to dog breeds when trained on human data. SALAI-Net outperforms previous methods in terms of balanced accuracy, while generalizing between different settings, species and datasets. Moreover, it is up to two orders of magnitude faster and uses considerably less RAM memory than competing methods.
We provide an open source implementation and links to publicly available data at github.com/AI-sandbox/SALAI-Net. Data is publicly available as follows: https://www.internationalgenome.org (1000 Genomes), https://www.simonsfoundation.org/simons-genome-diversity-project (Simons Genome Diversity Project), https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html (HapMap), ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516 (Human Genome Diversity Project) and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA448733 (Canid genomes).
Supplementary data are available from Bioinformatics online.
局部祖源推断(LAI)是对 DNA 序列中祖先标签的高分辨率预测。LAI 在人类历史和迁徙研究中很重要,并且它开始在精准医学应用中发挥作用,包括基于祖先的全基因组关联研究(GWAS)和多基因风险评分(PRSs)。现有的 LAI 模型在物种、染色体甚至祖源群体之间不能很好地推广,需要针对每个不同的设置进行重新训练。此外,此类方法可能缺乏可解释性,而可解释性是这些应用中的一个重要元素。
我们提出了 SALAI-Net,这是一种可应用于任何物种和祖源(与物种无关)的便携式统计 LAI 方法,仅需要单倍型数据,而不需要其他生物学参数。受同源法的启发,SALAI-Net 通过执行参考匹配方法来估计 DNA 片段的群体标签,从而产生一种可解释且快速的技术。我们在人类全基因组数据上对我们的模型进行了基准测试,并测试了这些模型在人类数据上训练时对犬种的泛化能力。在平衡准确性方面,SALAI-Net 优于以前的方法,同时在不同的设置、物种和数据集之间进行了推广。此外,它的速度快了两个数量级,并且使用的 RAM 内存比竞争方法少了几个数量级。
我们在 github.com/AI-sandbox/SALAI-Net 上提供了一个开源实现和指向公共可用数据的链接。数据可从以下网址获得:https://www.internationalgenome.org(1000 基因组)、https://www.simonsfoundation.org/simons-genome-diversity-project(西蒙斯基因组多样性项目)、https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html(人类基因组多样性计划)、ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516(人类基因组多样性计划)和 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA448733(犬科基因组)。
补充数据可从 Bioinformatics 在线获得。