Wang Xinyi, Zhang Mingyang, Long Chunlin, Yao Lin, Zhu Min
IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1469-1479. doi: 10.1109/TCBB.2022.3204661. Epub 2023 Apr 3.
Proteins binding to Ribonucleic Acid (RNA) inside cells are called RNA-binding proteins (RBP), which play a crucial role in gene regulation. The identification of RNA-protein binding sites helps to understand the function of RBP better. Although many computational methods have been developed to predict RNA-protein binding sites, their prediction accuracy on small sample datasets needs improvement. To overcome this limitation, we propose a novel model called SA-Net, which utilizes k-mer embedding to encode RNA sequences and a self-attention-based neural network to extract sequence features. K-mer embedding assists the model to discover significant subsequence fragments associated with binding sites. The self-attention mechanism captures contextual information from the entire input sequence globally, performing well in small sample sequence learning. Experimental results demonstrate that SA-Net attains state-of-the-art results on the RBP-24 dataset. We find that 4-mer embedding aids the model to achieve optimal performance. We also show that the self-attention network outperforms the commonly used CNN and CNN-BLSTM models in sequence feature extraction.
细胞内与核糖核酸(RNA)结合的蛋白质被称为RNA结合蛋白(RBP),其在基因调控中起着至关重要的作用。RNA-蛋白质结合位点的识别有助于更好地理解RBP的功能。尽管已经开发了许多计算方法来预测RNA-蛋白质结合位点,但它们在小样本数据集上的预测准确性仍有待提高。为了克服这一局限性,我们提出了一种名为SA-Net的新型模型,该模型利用k-mer嵌入对RNA序列进行编码,并使用基于自注意力的神经网络来提取序列特征。k-mer嵌入有助于模型发现与结合位点相关的重要子序列片段。自注意力机制全局捕获整个输入序列的上下文信息,在小样本序列学习中表现良好。实验结果表明,SA-Net在RBP-24数据集上取得了最优结果。我们发现4-mer嵌入有助于模型实现最优性能。我们还表明,自注意力网络在序列特征提取方面优于常用的CNN和CNN-BLSTM模型。