Research Unit of Mathematical Sciences, University of Oulu, FI-90014 University of Oulu, Finland.
G3 (Bethesda). 2024 Nov 6;14(11). doi: 10.1093/g3journal/jkae216.
In genomics, use of deep learning (DL) is rapidly growing and DL has successfully demonstrated its ability to uncover complex relationships in large biological and biomedical data sets. With the development of high-throughput sequencing techniques, genomic markers can now be allocated to large sections of a genome. By analyzing allele sharing between individuals, one may calculate realized genomic relationships from single-nucleotide polymorphisms (SNPs) data rather than relying on known pedigree relationships under polygenic model. The traditional approaches in genome-wide prediction (GWP) of quantitative phenotypes utilize genomic relationships in fixed global covariance modeling, possibly with some nonlinear kernel mapping (for example Gaussian processes). On the other hand, the DL approaches proposed so far for GWP fail to take into account the non-Euclidean graph structure of relationships between individuals over several generations. In this paper, we propose one global convolutional neural network (GCN) and one local sub-sampling architecture (GCN-RS) that are specifically designed to perform regression analysis based on genomic relationship information. A GCN is tailored to non-Euclidean spaces and consists of several layers of graph convolutions. The GCN-RS architecture is designed to further improve the GCN's performance by sub-sampling the graph to reduce the dimensionality of the input data. Through these graph convolutional layers, the GCN maps input genomic markers to their quantitative phenotype values. The graphs are constructed using an iterative nearest neighbor approach. Comparisons show that the GCN-RS outperforms the popular Genomic Best Linear Unbiased Predictor method on one simulated and three real datasets from wheat, mice and pig with a predictive improvement of 4.4% to 49.4% in terms of test mean squared error. This indicates that GCN-RS is a promising tool for genomic predictions in plants and animals. Furthermore, GCN-RS is computationally efficient, making it a viable option for large-scale applications.
在基因组学中,深度学习(DL)的使用正在迅速增长,并且 DL 已经成功地证明了其在揭示大型生物和生物医学数据集的复杂关系方面的能力。随着高通量测序技术的发展,现在可以将基因组标记分配到基因组的大片段。通过分析个体之间的等位基因共享,可以从单核苷酸多态性(SNP)数据中计算出实际的基因组关系,而不是依赖于多基因模型下已知的系谱关系。在全基因组预测(GWP)中,传统的定量表型预测方法利用固定全局协方差建模中的基因组关系,可能会使用一些非线性核映射(例如高斯过程)。另一方面,到目前为止,用于 GWP 的 DL 方法未能考虑到几代人之间个体之间的非欧几里得关系图结构。在本文中,我们提出了一种全局卷积神经网络(GCN)和一种局部子采样架构(GCN-RS),它们专门用于基于基因组关系信息进行回归分析。GCN 针对非欧几里得空间进行了定制,由几层图卷积组成。GCN-RS 架构旨在通过对子图进行采样来降低输入数据的维度,从而进一步提高 GCN 的性能。通过这些图卷积层,GCN 将输入的基因组标记映射到它们的定量表型值。图是使用迭代最近邻方法构建的。比较表明,GCN-RS 在一个模拟和来自小麦、小鼠和猪的三个真实数据集上均优于流行的基因组最佳线性无偏预测方法,在测试均方误差方面的预测提高了 4.4%至 49.4%。这表明 GCN-RS 是一种用于植物和动物基因组预测的有前途的工具。此外,GCN-RS 计算效率高,是大规模应用的可行选择。