School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, China.
Genes (Basel). 2019 Apr 3;10(4):273. doi: 10.3390/genes10040273.
With the rapid development of high-throughput sequencing technology, a large number of transcript sequences have been discovered, and how to identify long non-coding RNAs (lncRNAs) from transcripts is a challenging task. The identification and inclusion of lncRNAs not only can more clearly help us to understand life activities themselves, but can also help humans further explore and study the disease at the molecular level. At present, the detection of lncRNAs mainly includes two forms of calculation and experiment. Due to the limitations of bio sequencing technology and ineluctable errors in sequencing processes, the detection effect of these methods is not very satisfactory. In this paper, we constructed a deep-learning model to effectively distinguish lncRNAs from mRNAs. We used k-mer embedding vectors obtained through training the GloVe algorithm as input features and set up the deep learning framework to include a bidirectional long short-term memory model (BLSTM) layer and a convolutional neural network (CNN) layer with three additional hidden layers. By testing our model, we have found that it obtained the best values of 97.9%, 96.4% and 99.0% in F1score, accuracy and auROC, respectively, which showed better classification performance than the traditional PLEK, CNCI and CPC methods for identifying lncRNAs. We hope that our model will provide effective help in distinguishing mature mRNAs from lncRNAs, and become a potential tool to help humans understand and detect the diseases associated with lncRNAs.
随着高通量测序技术的飞速发展,大量的转录本序列被发现,如何从转录本中鉴定长非编码 RNA(lncRNA)是一项具有挑战性的任务。鉴定和包含 lncRNA 不仅可以更清楚地帮助我们理解生命活动本身,还可以帮助人类在分子水平上进一步探索和研究疾病。目前,lncRNA 的检测主要包括计算和实验两种形式。由于生物测序技术的局限性和测序过程中不可避免的错误,这些方法的检测效果并不十分理想。在本文中,我们构建了一个深度学习模型,以有效地从 mRNAs 中区分 lncRNAs。我们使用通过训练 GloVe 算法获得的 k-mer 嵌入向量作为输入特征,并建立了深度学习框架,包括一个双向长短期记忆模型(BLSTM)层和一个卷积神经网络(CNN)层,其中包含三个额外的隐藏层。通过测试我们的模型,我们发现它在 F1score、准确性和 auROC 方面分别获得了 97.9%、96.4%和 99.0%的最佳值,这表明与传统的 PLEK、CNCI 和 CPC 方法相比,它在识别 lncRNA 方面具有更好的分类性能。我们希望我们的模型将为区分成熟的 mRNAs 和 lncRNAs 提供有效的帮助,并成为帮助人类理解和检测与 lncRNAs 相关疾病的潜在工具。