Mantes Albert Dominguez, Montserrat Daniel Mas, Bustamante Carlos D, Giró-I-Nieto Xavier, Ioannidis Alexander G
Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States.
Signal Theory and Communications Department, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain.
Nat Comput Sci. 2023 Jul;3(7):621-629. doi: 10.1038/s43588-023-00482-7. Epub 2023 Jul 6.
Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by calculating multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.
随着基因研究扩展到大规模、日益多样化的生物样本库,描绘大型队列的基因结构变得越来越重要。流行的方法将个体基因组分解为分数聚类分配,每个聚类代表一个DNA变异频率向量。然而,随着生物样本库规模的迅速增加,这些方法在计算上变得难以处理。在这里,我们提出了神经混合模型(Neural ADMIXTURE),这是一种神经网络自动编码器,它遵循与当前标准算法混合模型(ADMIXTURE)相同的建模假设,同时将计算时间减少了几个数量级,甚至超过了最快的替代方法。使用混合模型(ADMIXTURE)进行一个月的连续计算,使用神经混合模型(Neural ADMIXTURE)可以减少到仅几个小时。多头方法允许神经混合模型(Neural ADMIXTURE)通过在一次运行中计算多个聚类数来提供进一步的加速。此外,可以存储模型,从而能够在线性时间内对新数据进行聚类分配,而无需共享训练样本。