RPIPLM：通过使用监督对比学习对双塔预训练生物模型进行训练后预测非编码RNA与蛋白质的相互作用

RPIPLM: Prediction of ncRNA-protein interaction by post-training a dual-tower pretrained biological model with supervised contrastive learning.

作者信息

Liu Yiwei, Bao Ting, Yin Peng, Wang Shumin, Wang Yanbin

机构信息

Defence Industry Secrecy Examination and Certification Center, Beijing, China.

National Key Laboratory of Science and Technology on Information System Security, Beijing, China.

出版信息

PLoS One. 2025 Aug 14;20(8):e0329174. doi: 10.1371/journal.pone.0329174. eCollection 2025.

DOI:10.1371/journal.pone.0329174

PMID:40811705

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12352837/

Abstract

The field of biological research has been profoundly impacted by the emergence of biological pre-trained models, which have resulted in remarkable advancements in life sciences and medicine. However, the current landscape of biological pre-trained language models suffers from a shortcoming, i.e., their inability to grasp the intricacies of molecular interactions, such as ncRNA-protein interactions. It is in this context that our paper introduces a two-tower computational framework, termed RPIPLM, which brings forth a new paradigm for the prediction of ncRNA-protein interactions. The core of RPIPLM lies in its harnessing of the pre-trained RNA language model and protein language model to process ncRNA and protein sequences, thereby enabling the transfer of the general knowledge gained from self-supervised learning of vast data to ncRNA-protein interaction tasks. Additionally, to learn the intricate interaction patterns between RNA and protein embeddings across diverse scales, we employ a fusion of scaled dot-product self-attention mechanism and Multi-scale convolution operations on the output of the dual-tower architecture, effectively capturing both global and local information. Furthermore, we introduce supervised contrastive learning into the training of RPIPLM, enabling the model to effectively capture discriminative information by distinguishing between interacting and non-interacting samples in the learned representations. Through extensive experiments and an interpretability study, we demonstrate the effectiveness of RPIPLM and its superiority over other methods, establishing new state-of-the-art performance. RPIPLM is a powerful and scalable computational framework that holds the potential to unlock enormous insights from vast biological data, thereby accelerating the discovery of molecular interactions.

摘要

生物预训练模型的出现对生物学研究领域产生了深远影响，推动了生命科学和医学的显著进步。然而，当前生物预训练语言模型存在一个缺陷，即它们无法理解分子相互作用的复杂性，如非编码RNA（ncRNA）与蛋白质的相互作用。在这种背景下，我们的论文介绍了一种双塔计算框架，称为RPIPLM，它为ncRNA与蛋白质相互作用的预测带来了新的范式。RPIPLM的核心在于利用预训练的RNA语言模型和蛋白质语言模型来处理ncRNA和蛋白质序列，从而将从大量数据的自监督学习中获得的通用知识转移到ncRNA与蛋白质相互作用任务中。此外，为了跨不同尺度学习RNA和蛋白质嵌入之间复杂的相互作用模式，我们在双塔架构的输出上采用了缩放点积自注意力机制和多尺度卷积操作的融合，有效捕获全局和局部信息。此外，我们将监督对比学习引入RPIPLM的训练中，使模型能够通过在学习表示中区分相互作用和非相互作用样本，有效捕获判别信息。通过广泛的实验和可解释性研究，我们证明了RPIPLM的有效性及其优于其他方法的性能，创造了新的最优性能。RPIPLM是一个强大且可扩展的计算框架，有潜力从海量生物数据中解锁大量见解，从而加速分子相互作用的发现。