Pratap Jayanth S, Krueger Ryan K, Rivas Elena
Department of Molecular and Cellular Biology, Cambridge, MA 02138, USA.
School of Engineering and Applied Sciences Harvard University, Cambridge, MA 02138, USA.
bioRxiv. 2025 Aug 2:2025.07.31.668042. doi: 10.1101/2025.07.31.668042.
Amidst the fast-developing trend of RNA large language models with millions of parameters, we asked what would be the minimum required to rediscover the rules of RNA canonical base pairing, mainly the Watson-Crick-Franklin A:U, G:C and the wobble G:U base pairs (the secondary structure). Here, we conclude that it does not require much at all. It does not require knowing secondary structures; it does not require aligning the sequences; and it does not require many parameters. We selected a probabilistic model of palindromes (a stochastic context-free grammar or SCFG) with a total of just 21 parameters. Using standard deep learning techniques, we estimate its parameters by implementing the generative process in an automatic differentiation (autodiff) framework and applying stochastic gradient descent (SGD). We define and minimize a loss function that does not use any structural or alignment information. Trained on as few as fifty RNA sequences, the rules of RNA base pairing emerge after only a few iterations of SGD. Crucially, the sole inputs are RNA sequences. When optimizing for sequences corresponding to structured RNAs, SGD also yields the rules of RNA base-pair aggregation into helices. Trained on shuffled sequences, the system optimizes by avoiding base pairing altogether. Trained on messenger RNAs, it reveals interactions that are different from those of structural RNAs, and specific to each mRNA. Our results show that the emergence of canonical base-pairing can be attributed to sequence-level signals that are robust and detectable even without labeled structures or alignments, and with very few parameters. Autodiff algorithms for probabilistic models, such as, but not restricted to SCFGs, have significant potential as they allow these models to be incorporated into end-to-end RNA deep learning methods for discerning transcripts of different functionalities.
在拥有数百万参数的RNA大语言模型快速发展的趋势下,我们提出疑问:重新发现RNA标准碱基配对规则(主要是沃森-克里克-富兰克林A:U、G:C和摆动G:U碱基对,即二级结构)所需的最小条件是什么。在此,我们得出结论,所需条件根本不多。它不需要知道二级结构;不需要比对序列;也不需要很多参数。我们选择了一个回文概率模型(一种随机上下文无关文法或SCFG),总共只有21个参数。使用标准的深度学习技术,我们通过在自动微分(autodiff)框架中实现生成过程并应用随机梯度下降(SGD)来估计其参数。我们定义并最小化一个不使用任何结构或比对信息的损失函数。在仅五十个RNA序列上进行训练,经过SGD的几次迭代后,RNA碱基配对规则就会出现。至关重要的是,唯一的输入就是RNA序列。当针对与结构化RNA对应的序列进行优化时,SGD还能得出RNA碱基对聚合成螺旋的规则。在随机打乱的序列上进行训练,系统会通过完全避免碱基配对来进行优化。在信使RNA上进行训练,它会揭示出与结构RNA不同且特定于每个mRNA的相互作用。我们的结果表明,标准碱基配对的出现可归因于序列水平的信号,即使没有标记结构或比对,且参数极少,这些信号依然稳健且可检测。概率模型的自动微分算法,例如但不限于SCFG,具有巨大潜力,因为它们允许将这些模型纳入端到端的RNA深度学习方法中,以辨别不同功能的转录本。