Li Sizhen, Noroozizadeh Shahriar, Moayedpour Saeed, Kogler-Anele Lorenzo, Xue Zexin, Zheng Dinghai, Montoya Fernando Ulloa, Agarwal Vikram, Bar-Joseph Ziv, Jager Sven
Digital R&D, Sanofi, Cambridge, MA 02141, United States.
Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States.
Nucleic Acids Res. 2025 Jan 24;53(3). doi: 10.1093/nar/gkaf044.
The success of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) messenger RNA (mRNA) vaccine has led to increased interest in the design and use of mRNA for vaccines and therapeutics. Still, selecting the most appropriate mRNA sequence for a protein remains a challenge. Several recent studies have shown that the specific mRNA sequence can have a significant impact on the translation efficiency, half-life, degradation rates, and other issues that play a major role in determining vaccine efficiency. To enable the selection of the most appropriate sequence, we developed mRNA-LM, an integrated small language model for modeling the entire mRNA sequence. mRNA-LM uses the contrastive language-image pretraining integration technology to combine three separate language models for the different mRNA segments. We trained mRNA-LM on millions of diverse mRNA sequences from several different species. The unsupervised model was able to learn meaningful biology related to evolution and host-pathogen interactions. Fine-tuning of mRNA-LM allowed us to use it in several mRNA property prediction tasks. As we show, using the full-length integrated model led to accurate predictions, improving on prior methods proposed for this task.
严重急性呼吸综合征冠状病毒2(SARS-CoV-2)信使核糖核酸(mRNA)疫苗的成功引发了人们对用于疫苗和治疗的mRNA设计与应用的更多关注。然而,为一种蛋白质选择最合适的mRNA序列仍然是一项挑战。最近的几项研究表明,特定的mRNA序列会对翻译效率、半衰期、降解速率以及其他在决定疫苗效率方面起主要作用的问题产生重大影响。为了能够选择最合适的序列,我们开发了mRNA-LM,这是一种用于对整个mRNA序列进行建模的集成小型语言模型。mRNA-LM使用对比语言-图像预训练集成技术,将针对不同mRNA片段的三个独立语言模型结合起来。我们在来自几个不同物种的数百万种不同mRNA序列上训练了mRNA-LM。这个无监督模型能够学习到与进化和宿主-病原体相互作用相关的有意义的生物学知识。对mRNA-LM进行微调使我们能够将其用于多个mRNA特性预测任务。正如我们所展示的,使用全长集成模型能够得出准确的预测结果,比此前针对这项任务提出的方法有所改进。