College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.
International Medical Center, Shenzhen University General Hospital, SZU 518055, China.
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae640.
Recent advancements in high-throughput sequencing technology have significantly increased the focus on non-coding RNA (ncRNA) research within the life sciences. Despite this, the functions of many ncRNAs remain poorly understood. Research suggests that ncRNAs within the same family typically share similar functions, underlining the importance of understanding their roles. There are two primary methods for predicting ncRNA families: biological and computational. Traditional biological methods are not suitable for large-scale data prediction due to the significant human and resource requirements. Concurrently, most existing computational methods either rely solely on ncRNA sequence data or are exclusively based on the secondary structure of ncRNA molecules. These methods fail to fully utilize the rich multimodal information available from ncRNAs, thereby preventing them from learning more comprehensive and in-depth feature representations.
To tackle these problems, we proposed MM-ncRNAFP, a multi-modal contrastive learning framework for ncRNA family prediction. We first used a pre-trained language model to encode the primary sequences of a large mammalian ncRNA dataset. Then, we adopted a contrastive learning framework with an attention mechanism to fuse the secondary structure information obtained by graph neural networks. The MM-ncRNAFP method can effectively fuse multi-modal information. Experimental comparisons with several competitive baselines demonstrated that MM-ncRNAFP can achieve more comprehensive representations of ncRNA features by integrating both sequence and structural information. This integration significantly enhances the performance of ncRNA family prediction. Ablation experiments and qualitative analyses were performed to verify the effectiveness of each component in our model. Moreover, since our model is pre-trained on a large amount of ncRNA data, it has the potential to bring significant improvements to other ncRNA-related tasks.
MM-ncRNAFP and the datasets are available at https://github.com/xuruiting2/MM-ncRNAFP.
高通量测序技术的最新进展极大地增加了生命科学领域对非编码 RNA(ncRNA)研究的关注。尽管如此,许多 ncRNA 的功能仍知之甚少。研究表明,同一家族的 ncRNA 通常具有相似的功能,这强调了理解它们作用的重要性。预测 ncRNA 家族有两种主要方法:生物学方法和计算方法。由于需要大量的人力和资源,传统的生物学方法不适用于大规模数据预测。同时,大多数现有的计算方法要么仅依赖于 ncRNA 序列数据,要么完全基于 ncRNA 分子的二级结构。这些方法未能充分利用 ncRNA 提供的丰富的多模态信息,从而无法学习更全面和深入的特征表示。
为了解决这些问题,我们提出了 MM-ncRNAFP,这是一种用于 ncRNA 家族预测的多模态对比学习框架。我们首先使用预训练的语言模型对大型哺乳动物 ncRNA 数据集的主要序列进行编码。然后,我们采用了具有注意力机制的对比学习框架来融合图神经网络获得的二级结构信息。MM-ncRNAFP 方法可以有效地融合多模态信息。与几个竞争基线的实验比较表明,MM-ncRNAFP 通过整合序列和结构信息,可以更全面地表示 ncRNA 特征。这种整合显著提高了 ncRNA 家族预测的性能。进行了消融实验和定性分析,以验证模型中每个组件的有效性。此外,由于我们的模型是在大量 ncRNA 数据上进行预训练的,因此它有可能为其他与 ncRNA 相关的任务带来显著的改进。
MM-ncRNAFP 和数据集可在 https://github.com/xuruiting2/MM-ncRNAFP 上获得。