Zhou Juannan, Martí-Gómez Carlos, Petti Samantha, McCandlish David M
bioRxiv. 2025 Aug 19:2025.08.15.670613. doi: 10.1101/2025.08.15.670613.
Understanding the relationship between biological sequences, such as DNA, RNA or protein sequences, and their resulting phenotypes is one of the central goals of genetics. This task is complicated by epistasis, i.e., the context dependence of mutational effects. Advances in high-throughput phenotyping now make it possible to study these relationships at unprecedented scale, generating large datasets that measure phenotypes for tens or hundreds of thousands of sequences. However, standard regression models for analyzing such datasets often make unrealistic assumptions about the generalizability of mutational effects and epistatic coefficients across genetic backgrounds. Deep neural networks offer greater flexibility but suffer from limited interpretability and lack uncertainty quantification. Here, we introduce a family of interpretable Gaussian process models for sequence-function relationships that capture epistasis through flexible prior distributions that generalize classical theoretical models from the fitness landscape literature. In particular, these priors are parameterized by interpretable site-, allele-, and mutation-specific factors controlling the degree to which specific mutations decrease the predictability of the effects of other mutations. Using GPU acceleration to scale to large protein, RNA, and genome-wide SNP datasets, our models consistently deliver superior predictive performance while yielding interpretable parameters that both recover known features and uncover novel epistatic interactions. Overall, our methods provide new insights into the structure of the genotype-phenotype map and offer scalable, interpretable approaches for exploring complex genetic interactions across diverse biological systems.
理解生物序列(如DNA、RNA或蛋白质序列)与其产生的表型之间的关系是遗传学的核心目标之一。上位性(即突变效应的上下文依赖性)使这项任务变得复杂。高通量表型分析技术的进步现在使得以前所未有的规模研究这些关系成为可能,生成了测量数万或数十万个序列表型的大型数据集。然而,用于分析此类数据集的标准回归模型通常对突变效应和上位性系数在不同遗传背景下的可推广性做出不切实际的假设。深度神经网络提供了更大的灵活性,但存在可解释性有限和缺乏不确定性量化的问题。在这里,我们引入了一族用于序列-功能关系的可解释高斯过程模型,该模型通过灵活的先验分布捕捉上位性,这些先验分布推广了适应性景观文献中的经典理论模型。特别是,这些先验由可解释的位点、等位基因和突变特异性因子参数化,这些因子控制特定突变降低其他突变效应可预测性的程度。利用GPU加速来扩展到大型蛋白质、RNA和全基因组SNP数据集,我们的模型始终提供卓越的预测性能,同时产生可解释的参数,既能恢复已知特征,又能揭示新的上位性相互作用。总体而言,我们的方法为基因型-表型图谱的结构提供了新的见解,并为探索不同生物系统中的复杂遗传相互作用提供了可扩展、可解释的方法。