用于RNA二级结构预测的大语言模型的全面基准测试。

Comprehensive benchmarking of large language models for RNA secondary structure prediction.

作者信息

Zablocki Luciano I, Bugnon Leandro A, Gerard Matias, Di Persia Leandro, Stegmayer Georgina, Milone Diego H

机构信息

Research Institute for Signals, Systems and Computational Intelligence, sinc (i), FICH-UNL/CONICET, Ruta Nacional Nº 168, km 472.4, Santa Fe (3000), Argentina.

出版信息

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf137.

DOI:10.1093/bib/bbaf137

PMID:40205851

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11982019/

Abstract

In recent years, inspired by the success of large language models (LLMs) for DNA and proteins, several LLMs for RNA have also been developed. These models take massive RNA datasets as inputs and learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector. This is done under the hypothesis that obtaining high-quality RNA representations can enhance data-costly downstream tasks, such as the fundamental RNA secondary structure prediction problem. However, existing RNA-LLM have not been evaluated for this task in a unified experimental setup. Since they are pretrained models, assessment of their generalization capabilities on new structures is a crucial aspect. Nonetheless, this has been just partially addressed in literature. In this work we present a comprehensive experimental and comparative analysis of pretrained RNA-LLM that have been recently proposed. We evaluate the use of these representations for the secondary structure prediction task with a common deep learning architecture. The RNA-LLM were assessed with increasing generalization difficulty on benchmark datasets. Results showed that two LLMs clearly outperform the other models, and revealed significant challenges for generalization in low-homology scenarios. Moreover, in this study we provide curated benchmark datasets of increasing complexity and a unified experimental setup for this scientific endeavor. Source code and curated benchmark datasets with increasing complexity are available in the repository: https://github.com/sinc-lab/rna-llm-folding/.

摘要

近年来，受用于DNA和蛋白质的大型语言模型（LLM）成功的启发，也开发了几种用于RNA的LLM。这些模型将大量RNA数据集作为输入，并以自监督的方式学习如何用语义丰富的数值向量来表示每个RNA碱基。这是在这样的假设下完成的：获得高质量的RNA表示可以增强数据成本高昂的下游任务，比如基本的RNA二级结构预测问题。然而，现有的RNA-LLM尚未在统一的实验设置中针对此任务进行评估。由于它们是预训练模型，评估其对新结构的泛化能力是一个关键方面。尽管如此，这在文献中只是部分得到了解决。在这项工作中，我们对最近提出的预训练RNA-LLM进行了全面的实验和比较分析。我们使用一种常见的深度学习架构评估这些表示在二级结构预测任务中的使用情况。在基准数据集上，随着泛化难度的增加对RNA-LLM进行了评估。结果表明，有两个LLM明显优于其他模型，并揭示了在低同源性场景下泛化所面临的重大挑战。此外，在本研究中，我们提供了复杂度不断增加的精选基准数据集以及用于这项科学工作的统一实验设置。可在以下存储库中获取具有不断增加复杂度的源代码和精选基准数据集：https://github.com/sinc-lab/rna-llm-folding/