Meng Jun, Kang Qiang, Chang Zheng, Luan Yushi
School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, Liaoning, China.
School of Bioengineering, Dalian University of Technology, Dalian, 116024, Liaoning, China.
BMC Bioinformatics. 2021 May 12;22(Suppl 3):242. doi: 10.1186/s12859-020-03870-2.
Long noncoding RNAs (lncRNAs) play an important role in regulating biological activities and their prediction is significant for exploring biological processes. Long short-term memory (LSTM) and convolutional neural network (CNN) can automatically extract and learn the abstract information from the encoded RNA sequences to avoid complex feature engineering. An ensemble model learns the information from multiple perspectives and shows better performance than a single model. It is feasible and interesting that the RNA sequence is considered as sentence and image to train LSTM and CNN respectively, and then the trained models are hybridized to predict lncRNAs. Up to present, there are various predictors for lncRNAs, but few of them are proposed for plant. A reliable and powerful predictor for plant lncRNAs is necessary.
To boost the performance of predicting lncRNAs, this paper proposes a hybrid deep learning model based on two encoding styles (PlncRNA-HDeep), which does not require prior knowledge and only uses RNA sequences to train the models for predicting plant lncRNAs. It not only learns the diversified information from RNA sequences encoded by p-nucleotide and one-hot encodings, but also takes advantages of lncRNA-LSTM proposed in our previous study and CNN. The parameters are adjusted and three hybrid strategies are tested to maximize its performance. Experiment results show that PlncRNA-HDeep is more effective than lncRNA-LSTM and CNN and obtains 97.9% sensitivity, 95.1% precision, 96.5% accuracy and 96.5% F1 score on Zea mays dataset which are better than those of several shallow machine learning methods (support vector machine, random forest, k-nearest neighbor, decision tree, naive Bayes and logistic regression) and some existing tools (CNCI, PLEK, CPC2, LncADeep and lncRNAnet).
PlncRNA-HDeep is feasible and obtains the credible predictive results. It may also provide valuable references for other related research.
长链非编码RNA(lncRNAs)在调节生物活性中发挥重要作用,其预测对于探索生物过程具有重要意义。长短期记忆网络(LSTM)和卷积神经网络(CNN)可以自动从编码的RNA序列中提取和学习抽象信息,避免复杂的特征工程。集成模型从多个角度学习信息,表现出比单一模型更好的性能。将RNA序列分别视为句子和图像来训练LSTM和CNN,然后将训练好的模型进行混合以预测lncRNAs是可行且有趣的。到目前为止,有各种lncRNAs预测器,但很少有针对植物提出的。需要一个可靠且强大的植物lncRNAs预测器。
为提高lncRNAs预测性能,本文提出一种基于两种编码方式的混合深度学习模型(PlncRNA-HDeep),该模型不需要先验知识,仅使用RNA序列训练用于预测植物lncRNAs的模型。它不仅从由p核苷酸和独热编码编码的RNA序列中学习多样化信息,还利用了我们先前研究中提出的lncRNA-LSTM和CNN的优势。调整参数并测试了三种混合策略以使其性能最大化。实验结果表明,PlncRNA-HDeep比lncRNA-LSTM和CNN更有效,在玉米数据集上获得了97.9%的灵敏度、95.1%的精确率、96.5%的准确率和96.5%的F1分数,优于几种浅层机器学习方法(支持向量机、随机森林、k近邻、决策树、朴素贝叶斯和逻辑回归)和一些现有工具(CNCI、PLEK、CPC2、LncADeep和lncRNAnet)。
PlncRNA-HDeep是可行的,并获得了可信的预测结果。它也可能为其他相关研究提供有价值的参考。