Suppr超能文献

Orthrus:迈向进化与功能RNA基础模型

Orthrus: Towards Evolutionary and Functional RNA Foundation Models.

作者信息

Fradkin Philip, Shi Ruian, Isaev Keren, Frey Brendan J, Morris Quaid, Lee Leo J, Wang Bo

机构信息

Vector Institute, Ontario, Canada.

Computer Science, University of Toronto, Ontario, Canada.

出版信息

bioRxiv. 2024 Dec 10:2024.10.10.617658. doi: 10.1101/2024.10.10.617658.

Abstract

In the face of rapidly accumulating genomic data, our ability to accurately predict key mature RNA properties that underlie transcript function and regulation remains limited. Pre-trained genomic foundation models offer an avenue to adapt learned RNA representations to biological prediction tasks. However, existing genomic foundation models are trained using strategies borrowed from textual or visual domains that do not leverage biological domain knowledge. Here, we introduce Orthrus, a Mamba-based mature RNA foundation model pre-trained using a novel self-supervised contrastive learning objective with biological augmentations. Orthrus is trained by maximizing embedding similarity between curated pairs of RNA transcripts, where pairs are formed from splice isoforms of 10 model organisms and transcripts from orthologous genes in 400+ mammalian species from the Zoonomia Project. This training objective results in a latent representation that clusters RNA sequences with functional and evolutionary similarities. We find that the generalized mature RNA isoform representations learned by Orthrus significantly outperform existing genomic foundation models on five mRNA property prediction tasks, and requires only a fraction of fine-tuning data to do so. Finally, we show that Orthrus is capable of capturing divergent biological function of individual transcript isoforms.

摘要

面对迅速积累的基因组数据,我们准确预测构成转录本功能和调控基础的关键成熟RNA特性的能力仍然有限。预训练的基因组基础模型为将学习到的RNA表示应用于生物预测任务提供了一条途径。然而,现有的基因组基础模型是使用从文本或视觉领域借鉴的策略进行训练的,这些策略并未利用生物领域知识。在此,我们介绍Orthrus,这是一种基于曼巴的成熟RNA基础模型,使用具有生物增强的新型自监督对比学习目标进行预训练。Orthrus通过最大化经过整理的RNA转录本对之间的嵌入相似度来进行训练,这些转录本对由10种模式生物的剪接异构体以及来自Zoonomia项目中400多个哺乳动物物种直系同源基因的转录本组成。这种训练目标产生了一种潜在表示,它将具有功能和进化相似性的RNA序列聚类在一起。我们发现,Orthrus学习到的广义成熟RNA异构体表示在五项mRNA特性预测任务上显著优于现有的基因组基础模型,并且仅需要一小部分微调数据就能做到这一点。最后我们表明,Orthrus能够捕捉单个转录本异构体的不同生物学功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b375/11639227/094df557da3f/nihpp-2024.10.10.617658v2-f0005.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验