IEEE/ACM Trans Comput Biol Bioinform. 2023 May-Jun;20(3):2078-2088. doi: 10.1109/TCBB.2022.3231466. Epub 2023 Jun 5.
Genomic selection (GS) is expected to accelerate plant and animal breeding. During the last decade, genome-wide polymorphism data have increased, which has raised concerns about storage cost and computational time. Several individual studies have attempted to compress the genome data and predict phenotypes. However, compression models lack adequate quality of data after compression, and prediction models are time consuming and use original data to predict the phenotype. Therefore, a combined application of compression and genomic prediction modeling using deep learning could resolve these limitations. A Deep Learning Compression-based Genomic Prediction (DeepCGP) model that can compress genome-wide polymorphism data and predict phenotypes of a target trait from compressed information was proposed. The DeepCGP model contained two parts: (i) an autoencoder model based on deep neural networks to compress genome-wide polymorphism data, and (ii) regression models based on random forests (RF), genomic best linear unbiased prediction (GBLUP), and Bayesian variable selection (BayesB) to predict phenotypes from compressed information. Two datasets with genome-wide marker genotypes and target trait phenotypes in rice were applied. The DeepCGP model obtained up to 99% prediction accuracy to the maximum for a trait after 98% compression. BayesB required extensive computational time among the three methods, and showed the highest accuracy; however, BayesB could only be used with compressed data. Overall, DeepCGP outperformed state-of-the-art methods in terms of both compression and prediction. Our code and data are available at https://github.com/tanzilamohita/DeepCGP.
基因组选择(GS)有望加速植物和动物的育种。在过去的十年中,全基因组多态性数据有所增加,这引起了人们对存储成本和计算时间的关注。已有几项研究试图压缩基因组数据并预测表型。然而,压缩模型在压缩后缺乏足够的数据质量,预测模型耗时且使用原始数据来预测表型。因此,深度学习结合压缩和基因组预测建模可以解决这些限制。提出了一种基于深度学习的压缩基因组预测(DeepCGP)模型,该模型可以压缩全基因组多态性数据,并从压缩信息中预测目标性状的表型。DeepCGP 模型包含两部分:(i)基于深度神经网络的自动编码器模型,用于压缩全基因组多态性数据,以及(ii)基于随机森林(RF)、基因组最佳线性无偏预测(GBLUP)和贝叶斯变量选择(BayesB)的回归模型,用于从压缩信息中预测表型。应用了两个具有全基因组标记基因型和水稻目标性状表型的数据集。DeepCGP 模型在 98%的压缩率下,最大可达 99%的预测精度。在这三种方法中,BayesB 所需的计算时间最长,准确性最高;但是,BayesB 只能用于压缩数据。总体而言,DeepCGP 在压缩和预测方面均优于最先进的方法。我们的代码和数据可在 https://github.com/tanzilamohita/DeepCGP 上获得。