Suppr超能文献

上下文学习可能会扭曲序列似然性与生物学适应性之间的关系。

In-Context Learning can distort the relationship between sequence likelihoods and biological fitness.

作者信息

Kantroo Pranav, Wagner Günter P, Machta Benjamin B

机构信息

Computational Biology and Bioinformatics Program, Yale University, New Haven, CT-06520, USA.

Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA.

出版信息

ArXiv. 2025 Apr 23:arXiv:2504.17068v1.

Abstract

Language models have emerged as powerful predictors of the viability of biological sequences. During training these models learn the rules of the grammar obeyed by sequences of amino acids or nucleotides. Once trained, these models can take a sequence as input and produce a likelihood score as an output; a higher likelihood implies adherence to the learned grammar and correlates with experimental fitness measurements. Here we show that in-context learning can distort the relationship between fitness and likelihood scores of sequences. This phenomenon most prominently manifests as anomalously high likelihood scores for sequences that contain repeated motifs. We use protein language models with different architectures trained on the masked language modeling objective for our experiments, and find transformer-based models to be particularly vulnerable to this effect. This behavior is mediated by a look-up operation where the model seeks the identity of the masked position by using the other copy of the repeated motif as a reference. This retrieval behavior can override the model's learned priors. This phenomenon persists for imperfectly repeated sequences, and extends to other kinds of biologically relevant features such as reversed complement motifs in RNA sequences that fold into hairpin structures.

摘要

语言模型已成为生物序列生存能力的强大预测工具。在训练过程中,这些模型学习氨基酸或核苷酸序列所遵循的语法规则。一旦训练完成,这些模型可以将序列作为输入,并产生一个似然分数作为输出;较高的似然性意味着遵循所学语法,并且与实验适应性测量结果相关。在这里,我们表明上下文学习会扭曲序列适应性与似然分数之间的关系。这种现象最显著地表现为包含重复基序的序列具有异常高的似然分数。我们使用基于掩码语言建模目标训练的不同架构的蛋白质语言模型进行实验,发现基于Transformer的模型特别容易受到这种影响。这种行为是由一种查找操作介导的,在该操作中,模型通过使用重复基序的另一个副本来查找掩码位置的身份。这种检索行为可以覆盖模型所学的先验知识。这种现象在不完全重复的序列中也存在,并扩展到其他类型的生物学相关特征,如折叠成发夹结构的RNA序列中的反向互补基序。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/12045397/725e0c9f2ab9/nihpp-2504.17068v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验