通过可扩展、可解释的高斯过程学习序列-函数关系。

Learning sequence-function relationships with scalable, interpretable Gaussian processes.

作者信息

Zhou Juannan, Martí-Gómez Carlos, Petti Samantha, McCandlish David M

出版信息

bioRxiv. 2025 Aug 19:2025.08.15.670613. doi: 10.1101/2025.08.15.670613.

DOI:10.1101/2025.08.15.670613

PMID:40894759

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12393447/

Abstract

Understanding the relationship between biological sequences, such as DNA, RNA or protein sequences, and their resulting phenotypes is one of the central goals of genetics. This task is complicated by epistasis, i.e., the context dependence of mutational effects. Advances in high-throughput phenotyping now make it possible to study these relationships at unprecedented scale, generating large datasets that measure phenotypes for tens or hundreds of thousands of sequences. However, standard regression models for analyzing such datasets often make unrealistic assumptions about the generalizability of mutational effects and epistatic coefficients across genetic backgrounds. Deep neural networks offer greater flexibility but suffer from limited interpretability and lack uncertainty quantification. Here, we introduce a family of interpretable Gaussian process models for sequence-function relationships that capture epistasis through flexible prior distributions that generalize classical theoretical models from the fitness landscape literature. In particular, these priors are parameterized by interpretable site-, allele-, and mutation-specific factors controlling the degree to which specific mutations decrease the predictability of the effects of other mutations. Using GPU acceleration to scale to large protein, RNA, and genome-wide SNP datasets, our models consistently deliver superior predictive performance while yielding interpretable parameters that both recover known features and uncover novel epistatic interactions. Overall, our methods provide new insights into the structure of the genotype-phenotype map and offer scalable, interpretable approaches for exploring complex genetic interactions across diverse biological systems.

摘要

理解生物序列（如DNA、RNA或蛋白质序列）与其产生的表型之间的关系是遗传学的核心目标之一。上位性（即突变效应的上下文依赖性）使这项任务变得复杂。高通量表型分析技术的进步现在使得以前所未有的规模研究这些关系成为可能，生成了测量数万或数十万个序列表型的大型数据集。然而，用于分析此类数据集的标准回归模型通常对突变效应和上位性系数在不同遗传背景下的可推广性做出不切实际的假设。深度神经网络提供了更大的灵活性，但存在可解释性有限和缺乏不确定性量化的问题。在这里，我们引入了一族用于序列-功能关系的可解释高斯过程模型，该模型通过灵活的先验分布捕捉上位性，这些先验分布推广了适应性景观文献中的经典理论模型。特别是，这些先验由可解释的位点、等位基因和突变特异性因子参数化，这些因子控制特定突变降低其他突变效应可预测性的程度。利用GPU加速来扩展到大型蛋白质、RNA和全基因组SNP数据集，我们的模型始终提供卓越的预测性能，同时产生可解释的参数，既能恢复已知特征，又能揭示新的上位性相互作用。总体而言，我们的方法为基因型-表型图谱的结构提供了新的见解，并为探索不同生物系统中的复杂遗传相互作用提供了可扩展、可解释的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2fa1/12393447/3a96585a3c07/nihpp-2025.08.15.670613v1-f0001.jpg

相似文献

Learning sequence-function relationships with scalable, interpretable Gaussian processes.通过可扩展、可解释的高斯过程学习序列-函数关系。

bioRxiv. 2025 Aug 19:2025.08.15.670613. doi: 10.1101/2025.08.15.670613.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Short-Term Memory Impairment短期记忆障碍

Genetic determinants of testicular sperm extraction outcomes: insights from a large multicentre study of men with non-obstructive azoospermia.睾丸精子提取结果的遗传决定因素：来自一项针对非梗阻性无精子症男性的大型多中心研究的见解

Hum Reprod Open. 2025 Aug 29;2025(3):hoaf049. doi: 10.1093/hropen/hoaf049. eCollection 2025.

Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验：定性证据综合。

Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.

Measures implemented in the school setting to contain the COVID-19 pandemic.学校为控制 COVID-19 疫情而采取的措施。

Cochrane Database Syst Rev. 2022 Jan 17;1(1):CD015029. doi: 10.1002/14651858.CD015029.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Sexual Harassment and Prevention Training性骚扰与预防培训

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.稳定机器学习以获得可重复和可解释的结果：一种针对特定个体见解的新型验证方法。

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

The Lived Experience of Autistic Adults in Employment: A Systematic Search and Synthesis.成年自闭症患者的就业生活经历：系统检索与综述

Autism Adulthood. 2024 Dec 2;6(4):495-509. doi: 10.1089/aut.2022.0114. eCollection 2024 Dec.

本文引用的文献

Refining the resolution of the yeast genotype-phenotype map using single-cell RNA-sequencing.利用单细胞RNA测序细化酵母基因型-表型图谱的分辨率

Elife. 2025 Jul 28;13:RP93906. doi: 10.7554/eLife.93906.

Massive experimental quantification allows interpretable deep learning of protein aggregation.大规模实验量化实现了对蛋白质聚集的可解释深度学习。

Sci Adv. 2025 May 2;11(18):eadt5111. doi: 10.1126/sciadv.adt5111. Epub 2025 Apr 30.

Gauge fixing for sequence-function relationships.序列-功能关系的规范固定

PLoS Comput Biol. 2025 Mar 20;21(3):e1012818. doi: 10.1371/journal.pcbi.1012818. eCollection 2025.

Massively parallel characterization of transcriptional regulatory elements.转录调控元件的大规模并行表征

Nature. 2025 Mar;639(8054):411-420. doi: 10.1038/s41586-024-08430-9. Epub 2025 Jan 15.

Site-saturation mutagenesis of 500 human protein domains.500个人类蛋白质结构域的位点饱和诱变

Nature. 2025 Jan;637(8047):885-894. doi: 10.1038/s41586-024-08370-4. Epub 2025 Jan 8.

MoCHI: neural networks to fit interpretable models and quantify energies, energetic couplings, epistasis, and allostery from deep mutational scanning data.MoCHI：用于拟合可解释模型并从深度突变扫描数据中量化能量、能量耦合、上位性和变构的神经网络。

Genome Biol. 2024 Dec 2;25(1):303. doi: 10.1186/s13059-024-03444-y.

Density estimation for ordinal biological sequences and its applications.有序生物序列的密度估计及其应用。

Phys Rev E. 2024 Oct;110(4-1):044408. doi: 10.1103/PhysRevE.110.044408.

The genetic architecture of protein stability.蛋白质稳定性的遗传结构。

Nature. 2024 Oct;634(8035):995-1003. doi: 10.1038/s41586-024-07966-0. Epub 2024 Sep 25.

Addressing epistasis in the design of protein function.解决蛋白质功能设计中的上位效应。

Proc Natl Acad Sci U S A. 2024 Aug 20;121(34):e2314999121. doi: 10.1073/pnas.2314999121. Epub 2024 Aug 12.

Neural network extrapolation to distant regions of the protein fitness landscape.神经网络对蛋白质适应度景观的遥远区域进行外推。

Nat Commun. 2024 Jul 30;15(1):6405. doi: 10.1038/s41467-024-50712-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过可扩展、可解释的高斯过程学习序列-函数关系。

Learning sequence-function relationships with scalable, interpretable Gaussian processes.

作者信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献