Xiang Weixi, Li Zhaoxin, Sun Qixin, Chai Xiujuan, Sun Tan
Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China.
Animals (Basel). 2025 Aug 24;15(17):2485. doi: 10.3390/ani15172485.
Accurate genomic prediction of complex phenotypes is crucial for accelerating genetic progress in swine breeding. However, conventional methods like Genomic Best Linear Unbiased Prediction (GBLUP) face limitations in capturing complex non-additive effects that contribute significantly to phenotypic variation, restricting the potential accuracy of phenotype prediction. To address this challenge, we introduce a novel framework based on a self-supervised, pre-trained encoder-only Transformer model. Its core novelty lies in tokenizing SNP sequences into non-overlapping 6-mers (sequences of 6 SNPs), enabling the model to directly learn local haplotype patterns instead of treating SNPs as independent markers. The model first undergoes self-supervised pre-training on the unlabeled version of the same SNP dataset used for subsequent fine-tuning, learning intrinsic genomic representations through a masked 6-mer prediction task. Subsequently, the pre-trained model is fine-tuned on labeled data to predict phenotypic values for specific economic traits. Experimental validation demonstrates that our proposed model consistently outperforms baseline methods, including GBLUP and a Transformer of the same architecture trained from scratch (without pre-training), in prediction accuracy across key economic traits. This outperformance suggests the model's capacity to capture non-linear genetic signals missed by linear models. This research contributes not only a new, more accurate methodology for genomic phenotype prediction but also validates the potential of self-supervised learning to decipher complex genomic patterns for direct application in breeding programs. Ultimately, this approach offers a powerful new tool to enhance the rate of genetic gain in swine production by enabling more precise selection based on predicted phenotypes.
复杂表型的准确基因组预测对于加快猪育种的遗传进展至关重要。然而,像基因组最佳线性无偏预测(GBLUP)这样的传统方法在捕捉对表型变异有重大贡献的复杂非加性效应方面存在局限性,限制了表型预测的潜在准确性。为了应对这一挑战,我们引入了一种基于自监督、仅预训练编码器的Transformer模型的新颖框架。其核心新颖之处在于将单核苷酸多态性(SNP)序列分割为不重叠的6聚体(6个SNP的序列),使模型能够直接学习局部单倍型模式,而不是将SNP视为独立标记。该模型首先在用于后续微调的相同SNP数据集的未标记版本上进行自监督预训练,通过掩蔽6聚体预测任务学习内在基因组表示。随后,对预训练模型在标记数据上进行微调,以预测特定经济性状的表型值。实验验证表明,我们提出的模型在关键经济性状的预测准确性方面始终优于基线方法,包括GBLUP和从零开始训练(无预训练)的相同架构的Transformer。这种优异表现表明该模型能够捕捉线性模型遗漏的非线性遗传信号。这项研究不仅为基因组表型预测贡献了一种新的、更准确的方法,还验证了自监督学习在破译复杂基因组模式以直接应用于育种计划方面的潜力。最终,这种方法提供了一个强大的新工具,通过基于预测表型进行更精确的选择来提高猪生产中的遗传增益率。