关于生物序列空间上的学习函数：关联高斯过程先验、正则化和规范固定。

On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing.

作者信息

Petti Samantha, Martí-Gómez Carlos, Kinney Justin B, Zhou Juannan, McCandlish David M

机构信息

Department of Mathematics, Tufts University, Medford, MA, 02155.

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724.

出版信息

bioRxiv. 2025 Jul 11:2025.04.26.650699. doi: 10.1101/2025.04.26.650699.

DOI:10.1101/2025.04.26.650699

PMID:40672195

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12265701/

Abstract

Mappings from biological sequences (DNA, RNA, protein) to quantitative measures of sequence functionality play an important role in contemporary biology. We are interested in the related tasks of (i) inferring predictive sequence-to-function maps and (ii) decomposing sequence-function maps to elucidate the contributions of individual subsequences. Because each sequence-function map can be written as a weighted sum over subsequences in multiple ways, meaningfully interpreting these weights requires "gauge-fixing," i.e., defining a unique representation for each map. Recent work has established that most existing gauge-fixed representations arise as the unique solutions to -regularized regression in an overparameterized "weight space" where the choice of regularizer defines the gauge. Here, we establish the relationship between regularized regression in overparameterized weight space and Gaussian process approaches that operate in "function space," i.e. the space of all real-valued functions on a finite set of sequences. We disentangle how weight space regularizers both impose an implicit prior on the learned function and restrict the optimal weights to a particular gauge. We also show how to construct regularizers that correspond to arbitrary explicit Gaussian process priors combined with a wide variety of gauges. Next, we derive the distribution of gauge-fixed weights implied by the Gaussian process posterior and demonstrate that even for long sequences this distribution can be efficiently computed for product-kernel priors using a kernel trick. Finally, we characterize the implicit function space priors associated with the most common weight space regularizers. Overall, our framework unifies and extends our ability to infer and interpret sequence-function relationships.

摘要

从生物序列（DNA、RNA、蛋白质）到序列功能定量度量的映射在当代生物学中起着重要作用。我们对以下相关任务感兴趣：（i）推断预测性的序列到功能映射，以及（ii）分解序列 - 功能映射以阐明各个子序列的贡献。由于每个序列 - 功能映射可以多种方式写成子序列的加权和，有意义地解释这些权重需要“规范固定”，即，为每个映射定义唯一表示。最近的工作表明，大多数现有的规范固定表示是在过参数化的“权重空间”中作为 - 正则化回归的唯一解出现的，其中正则化器的选择定义了规范。在这里，我们建立了过参数化权重空间中的正则化回归与在“函数空间”（即有限序列集上所有实值函数的空间）中运行的高斯过程方法之间的关系。我们弄清楚了权重空间正则化器如何既对学习到的函数施加隐式先验，又将最优权重限制在特定规范内。我们还展示了如何构造与任意显式高斯过程先验以及各种规范相对应的正则化器。接下来，我们推导高斯过程后验所隐含的规范固定权重的分布，并证明即使对于长序列，使用核技巧也可以有效地计算乘积核先验的这种分布。最后，我们刻画了与最常见权重空间正则化器相关的隐式函数空间先验。总体而言，我们的框架统一并扩展了我们推断和解释序列 - 功能关系的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3bd1/12320631/34ac278497bf/nihpp-2025.04.26.650699v3-f0001.jpg

相似文献

On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing.关于生物序列空间上的学习函数：关联高斯过程先验、正则化和规范固定。

bioRxiv. 2025 Jul 11:2025.04.26.650699. doi: 10.1101/2025.04.26.650699.

ArXiv. 2025 Jul 11:arXiv:2504.19034v2.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Short-Term Memory Impairment短期记忆障碍

Measures implemented in the school setting to contain the COVID-19 pandemic.学校为控制 COVID-19 疫情而采取的措施。

Cochrane Database Syst Rev. 2022 Jan 17;1(1):CD015029. doi: 10.1002/14651858.CD015029.

The Black Book of Psychotropic Dosing and Monitoring.《精神药物剂量与监测黑皮书》

Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.

Elbow Fractures Overview肘部骨折概述

Audit and feedback: effects on professional practice.审核与反馈：对专业实践的影响

Cochrane Database Syst Rev. 2025 Mar 25;3(3):CD000259. doi: 10.1002/14651858.CD000259.pub4.

Plug-and-play use of tree-based methods: consequences for clinical prediction modeling.基于树的方法的即插即用：对临床预测模型的影响。

J Clin Epidemiol. 2025 Aug;184:111834. doi: 10.1016/j.jclinepi.2025.111834. Epub 2025 May 19.

"I Don't Understand Their Sense of Belonging": Exploring How Nonbinary Autistic Adults Experience Gender.“我不理解他们的归属感”：探索非二元性别的自闭症成年人如何体验性别。

Autism Adulthood. 2024 Dec 2;6(4):462-473. doi: 10.1089/aut.2023.0071. eCollection 2024 Dec.

本文引用的文献

Symmetry, gauge freedoms, and the interpretability of sequence-function relationships.对称性、规范自由度与序列-功能关系的可解释性。

Phys Rev Res. 2025 Apr-Jun;7(2). doi: 10.1103/physrevresearch.7.023005. Epub 2025 Apr 2.

Gauge fixing for sequence-function relationships.序列-功能关系的规范固定

PLoS Comput Biol. 2025 Mar 20;21(3):e1012818. doi: 10.1371/journal.pcbi.1012818. eCollection 2025.

MoCHI: neural networks to fit interpretable models and quantify energies, energetic couplings, epistasis, and allostery from deep mutational scanning data.MoCHI：用于拟合可解释模型并从深度突变扫描数据中量化能量、能量耦合、上位性和变构的神经网络。

Genome Biol. 2024 Dec 2;25(1):303. doi: 10.1186/s13059-024-03444-y.

The simplicity of protein sequence-function relationships.蛋白质序列与功能关系的简单性。

Nat Commun. 2024 Sep 11;15(1):7953. doi: 10.1038/s41467-024-51895-5.

An extension of the Walsh-Hadamard transform to calculate and model epistasis in genetic landscapes of arbitrary shape and complexity.将 Walsh-Hadamard 变换扩展到计算和建模任意形状和复杂程度的遗传景观中的上位性。

PLoS Comput Biol. 2024 May 28;20(5):e1012132. doi: 10.1371/journal.pcbi.1012132. eCollection 2024 May.

Global epistasis and the emergence of function in microbial consortia.全球上位性与微生物群落功能的出现。

Cell. 2024 Jun 6;187(12):3108-3119.e30. doi: 10.1016/j.cell.2024.04.016. Epub 2024 May 21.

Machine learning for functional protein design.用于功能性蛋白质设计的机器学习

Nat Biotechnol. 2024 Feb;42(2):216-228. doi: 10.1038/s41587-024-02127-0. Epub 2024 Feb 15.

Epistasis and evolution: recent advances and an outlook for prediction.上位性与进化：最新进展与预测展望。

BMC Biol. 2023 May 24;21(1):120. doi: 10.1186/s12915-023-01585-3.

Higher-order epistasis and phenotypic prediction.高阶上位性与表型预测。

Proc Natl Acad Sci U S A. 2022 Sep 27;119(39):e2204233119. doi: 10.1073/pnas.2204233119. Epub 2022 Sep 21.

MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect.MAVE-NN：从变异效应的多重分析中学习基因型-表型图谱。

Genome Biol. 2022 Apr 15;23(1):98. doi: 10.1186/s13059-022-02661-7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

关于生物序列空间上的学习函数：关联高斯过程先验、正则化和规范固定。

On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献