College of Computer Science and Technology, Jilin University, Changchun, China.
BMC Bioinformatics. 2021 Sep 20;22(1):447. doi: 10.1186/s12859-021-04365-4.
Studies have proven that the same family of non-coding RNAs (ncRNAs) have similar functions, so predicting the ncRNAs family is helpful to the research of ncRNAs functions. The existing calculation methods mainly fall into two categories: the first type is to predict ncRNAs family by learning the features of sequence or secondary structure, and the other type is to predict ncRNAs family by the alignment among homologs sequences. In the first type, some methods predict ncRNAs family by learning predicted secondary structure features. The inaccuracy of predicted secondary structure may cause the low accuracy of those methods. Different from that, ncRFP directly learning the features of ncRNA sequences to predict ncRNAs family. Although ncRFP simplifies the prediction process and improves the performance, there is room for improvement in ncRFP performance due to the incomplete features of its input data. In the secondary type, the homologous sequence alignment method can achieve the highest performance at present. However, due to the need for consensus secondary structure annotation of ncRNA sequences, and the helplessness for modeling pseudoknots, the use of the method is limited.
In this paper, a novel method "ncDLRES", which according to learning the sequence features, is proposed to predict the family of ncRNAs based on Dynamic LSTM (Long Short-term Memory) and ResNet (Residual Neural Network).
ncDLRES extracts the features of ncRNA sequences based on Dynamic LSTM and then classifies them by ResNet. Compared with the homologous sequence alignment method, ncDLRES reduces the data requirement and expands the application scope. By comparing with the first type of methods, the performance of ncDLRES is greatly improved.
研究证明,同一类非编码 RNA(ncRNA)具有相似的功能,因此预测 ncRNA 家族有助于研究 ncRNA 的功能。现有的计算方法主要分为两类:第一类是通过学习序列或二级结构的特征来预测 ncRNA 家族,另一类是通过同源序列的比对来预测 ncRNA 家族。在第一类中,一些方法通过学习预测的二级结构特征来预测 ncRNA 家族。预测的二级结构的不准确性可能导致这些方法的准确性较低。与这一点不同,ncRFP 直接学习 ncRNA 序列的特征来预测 ncRNA 家族。虽然 ncRFP 简化了预测过程并提高了性能,但由于其输入数据的特征不完整,ncRFP 的性能仍有改进的空间。在第二类中,同源序列比对方法目前可以达到最高的性能。然而,由于需要 ncRNA 序列的共识二级结构注释,以及对建模假结的无奈,该方法的使用受到限制。
本文提出了一种新的方法“ncDLRES”,它根据学习序列特征,基于动态 LSTM(长短期记忆)和 ResNet(残差神经网络)来预测 ncRNA 家族。
ncDLRES 基于动态 LSTM 提取 ncRNA 序列的特征,然后通过 ResNet 对其进行分类。与同源序列比对方法相比,ncDLRES 减少了数据需求并扩大了应用范围。与第一类方法相比,ncDLRES 的性能有了很大的提高。