Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China.
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China.
Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae004.
In recent years, circular RNAs (circRNAs), the particular form of RNA with a closed-loop structure, have attracted widespread attention due to their physiological significance (they can directly bind proteins), leading to the development of numerous protein site identification algorithms. Unfortunately, these studies are supervised and require the vast majority of labeled samples in training to produce superior performance. But the acquisition of sample labels requires a large number of biological experiments and is difficult to obtain.
To resolve this matter that a great deal of tags need to be trained in the circRNA-binding site prediction task, a self-supervised learning binding site identification algorithm named CircSI-SSL is proposed in this article. According to the survey, this is unprecedented in the research field. Specifically, CircSI-SSL initially combines multiple feature coding schemes and employs RNA_Transformer for cross-view sequence prediction (self-supervised task) to learn mutual information from the multi-view data, and then fine-tuning with only a few sample labels. Comprehensive experiments on six widely used circRNA datasets indicate that our CircSI-SSL algorithm achieves excellent performance in comparison to previous algorithms, even in the extreme case where the ratio of training data to test data is 1:9. In addition, the transplantation experiment of six linRNA datasets without network modification and hyperparameter adjustment shows that CircSI-SSL has good scalability. In summary, the prediction algorithm based on self-supervised learning proposed in this article is expected to replace previous supervised algorithms and has more extensive application value.
The source code and data are available at https://github.com/cc646201081/CircSI-SSL.
近年来,具有闭环结构的特殊 RNA 形式——环状 RNA(circRNAs),由于其生理意义(可以直接结合蛋白质)而引起了广泛关注,由此产生了许多蛋白质位点识别算法。不幸的是,这些研究都是有监督的,并且需要在训练中使用绝大多数标记样本才能产生优异的性能。但是,获取样本标签需要进行大量的生物学实验,并且难以获得。
为了解决在 circRNA 结合位点预测任务中需要大量训练标签的问题,本文提出了一种名为 CircSI-SSL 的自监督学习结合位点识别算法。据调查,这在该研究领域尚属首次。具体来说,CircSI-SSL 首先结合了多种特征编码方案,并使用 RNA_Transformer 进行跨视图序列预测(自监督任务),以从多视图数据中学习互信息,然后仅使用少量样本标签进行微调。在六个广泛使用的 circRNA 数据集上进行的综合实验表明,与以前的算法相比,我们的 CircSI-SSL 算法具有出色的性能,即使在训练数据与测试数据的比例为 1:9 的极端情况下也是如此。此外,无需网络修改和超参数调整的六个 linRNA 数据集的移植实验表明,CircSI-SSL 具有良好的可扩展性。综上所述,本文提出的基于自监督学习的预测算法有望替代以前的监督算法,具有更广泛的应用价值。
源代码和数据可在 https://github.com/cc646201081/CircSI-SSL 上获得。