具有正交潜在空间的混合自动编码器，用于稳健的群体结构推断。

Hybrid autoencoder with orthogonal latent space for robust population structure inference.

机构信息

Department of Electrical Engineering, ESAT/PSI, KU Leuven, Leuven, Belgium.

Department of Human Genetics, KU Leuven, Leuven, Belgium.

出版信息

Sci Rep. 2023 Feb 14;13(1):2612. doi: 10.1038/s41598-023-28759-x.

DOI:10.1038/s41598-023-28759-x

PMID:36788253

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9929087/

Abstract

Analysis of population structure and genomic ancestry remains an important topic in human genetics and bioinformatics. Commonly used methods require high-quality genotype data to ensure accurate inference. However, in practice, laboratory artifacts and outliers are often present in the data. Moreover, existing methods are typically affected by the presence of related individuals in the dataset. In this work, we propose a novel hybrid method, called SAE-IBS, which combines the strengths of traditional matrix decomposition-based (e.g., principal component analysis) and more recent neural network-based (e.g., autoencoders) solutions. Namely, it yields an orthogonal latent space enhancing dimensionality selection while learning non-linear transformations. The proposed approach achieves higher accuracy than existing methods for projecting poor quality target samples (genotyping errors and missing data) onto a reference ancestry space and generates a robust ancestry space in the presence of relatedness. We introduce a new approach and an accompanying open-source program for robust ancestry inference in the presence of missing data, genotyping errors, and relatedness. The obtained ancestry space allows for non-linear projections and exhibits orthogonality with clearly separable population groups.

摘要

人口结构和基因组血统分析仍然是人类遗传学和生物信息学的重要课题。常用的方法需要高质量的基因型数据，以确保准确的推断。然而，在实践中，数据中常常存在实验室伪影和异常值。此外，现有的方法通常受到数据集内相关个体的存在的影响。在这项工作中，我们提出了一种新的混合方法，称为 SAE-IBS，它结合了传统基于矩阵分解（例如主成分分析）和最近基于神经网络（例如自动编码器）的解决方案的优势。它产生一个正交的潜在空间，增强了维度选择，同时学习非线性变换。与现有方法相比，该方法在将质量较差的目标样本（基因分型错误和缺失数据）投影到参考血统空间上时具有更高的准确性，并在存在亲缘关系的情况下生成稳健的血统空间。我们引入了一种新的方法和一个伴随的开源程序，用于在存在缺失数据、基因分型错误和亲缘关系的情况下进行稳健的血统推断。所得到的血统空间允许进行非线性投影，并表现出与清晰可分离的群体正交性。