Dai Qiguo, Guo Maozu, Duan Xiaodong, Teng Zhixia, Fu Yueyue
School of Computer Science and Engineering, Dalian Minzu University, Dalian, China.
Dalian Key Laboratory of Digital Technology for National Culture, Dalian Minzu University, Dalian, China.
Front Genet. 2019 Feb 1;10:18. doi: 10.3389/fgene.2019.00018. eCollection 2019.
Non-coding RNA (ncRNA) plays important roles in many critical regulation processes. Many ncRNAs perform their regulatory functions by the form of RNA-protein complexes. Therefore, identifying the interaction between ncRNA and protein is fundamental to understand functions of ncRNA. Under pressures from expensive cost of experimental techniques, developing an accuracy computational predictive model has become an indispensable way to identify ncRNA-protein interaction. A powerful predicting model of ncRNA-protein interaction needs a good feature set of characterizing the interaction. In this paper, a novel method is put forward to generate complex features for characterizing ncRNA-protein interaction (named CFRP). To obtain a comprehensive description of ncRNA-protein interaction, complex features are generated by non-linear transformations from the traditional k-mer features of ncRNA and protein sequences. To further reduce the dimensions of complex features, a group of discriminative features are selected by random forest. To validate the performances of the proposed method, a series of experiments are carried on several widely-used public datasets. Compared with the traditional k-mer features, the CFRP complex features can boost the performances of ncRNA-protein interaction prediction model. Meanwhile, the CFRP-based prediction model is compared with several state-of-the-art methods, and the results show that the proposed method achieves better performances than the others in term of the evaluation metrics. In conclusion, the complex features generated by CFRP are beneficial for building a powerful predicting model of ncRNA-protein interaction.
非编码RNA(ncRNA)在许多关键调控过程中发挥着重要作用。许多ncRNA通过RNA-蛋白质复合物的形式执行其调控功能。因此,识别ncRNA与蛋白质之间的相互作用是理解ncRNA功能的基础。在实验技术成本高昂的压力下,开发一种准确的计算预测模型已成为识别ncRNA-蛋白质相互作用不可或缺的方法。一个强大的ncRNA-蛋白质相互作用预测模型需要一个良好的用于表征相互作用的特征集。本文提出了一种新的方法来生成用于表征ncRNA-蛋白质相互作用的复杂特征(命名为CFRP)。为了全面描述ncRNA-蛋白质相互作用,通过对ncRNA和蛋白质序列的传统k-mer特征进行非线性变换来生成复杂特征。为了进一步降低复杂特征的维度,通过随机森林选择一组有区分力的特征。为了验证所提方法的性能,在几个广泛使用的公共数据集上进行了一系列实验。与传统的k-mer特征相比,CFRP复杂特征可以提高ncRNA-蛋白质相互作用预测模型的性能。同时,将基于CFRP的预测模型与几种最先进的方法进行比较,结果表明所提方法在评估指标方面比其他方法具有更好的性能。总之,CFRP生成的复杂特征有利于构建一个强大的ncRNA-蛋白质相互作用预测模型。