Rijal Krishna, Holmes Caroline M, Petti Samantha, Reddy Gautam, Desai Michael M, Mehta Pankaj
Department of Physics, Boston University, Boston, MA.
Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA.
ArXiv. 2025 Apr 14:arXiv:2504.10388v1.
Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression. These methods generally assume that the genetic architecture of complex traits can be parameterized in terms of an additive model, where the effects of loci are independent, plus (in some cases) pairwise epistatic interactions between loci. However, these models struggle to analyze more complex patterns of epistasis or subtle gene-environment interactions. Recent advances in machine learning, particularly attention-based models, offer a promising alternative. Initially developed for natural language processing, attention-based models excel at capturing context-dependent interactions and have shown exceptional performance in predicting protein structure and function. Here, we apply attention-based models to quantitative genetics. We analyze the performance of this attention-based approach in predicting phenotype from genotype using simulated data across a range of models with increasing epistatic complexity, and using experimental data from a recent quantitative trait locus mapping study in budding yeast. We find that our model demonstrates superior out-of-sample predictions in epistatic regimes compared to standard methods. We also explore a more general multi-environment attention-based model to jointly analyze genotype-phenotype maps across multiple environments and show that such architectures can be used for "transfer learning" - predicting phenotypes in novel environments with limited training data.
从基因型预测表型是遗传学中的一项核心挑战。数量遗传学的传统方法通常使用基于线性回归的方法来分析这个问题。这些方法一般假设复杂性状的遗传结构可以用加性模型来参数化,其中基因座的效应是独立的,再加上(在某些情况下)基因座之间的成对上位性相互作用。然而,这些模型难以分析更复杂的上位性模式或微妙的基因 - 环境相互作用。机器学习的最新进展,特别是基于注意力的模型,提供了一种有前景的替代方法。基于注意力的模型最初是为自然语言处理而开发的,擅长捕捉上下文相关的相互作用,并且在预测蛋白质结构和功能方面表现出色。在这里,我们将基于注意力的模型应用于数量遗传学。我们使用一系列上位性复杂度不断增加的模拟数据,以及来自最近一项关于芽殖酵母数量性状基因座定位研究的实验数据,分析这种基于注意力的方法在从基因型预测表型方面的性能。我们发现,与标准方法相比,我们的模型在上位性情况下展示出更优的样本外预测能力。我们还探索了一种更通用的基于多环境注意力的模型,以联合分析多个环境中的基因型 - 表型图谱,并表明这种架构可用于“迁移学习”——在训练数据有限的新环境中预测表型。