School of Electrical Engineering and Computer Science, Oregon State University, 1148 Kelley Engineering Center, Corvallis, OR 97331, USA.
Department of Biochemistry and Biophysics, Oregon State University, 2011 Ag & Life Sciences Bldg, Corvallis, OR 97331, USA.
Nucleic Acids Res. 2018 Sep 19;46(16):8105-8113. doi: 10.1093/nar/gky567.
The current deluge of newly identified RNA transcripts presents a singular opportunity for improved assessment of coding potential, a cornerstone of genome annotation, and for machine-driven discovery of biological knowledge. While traditional, feature-based methods for RNA classification are limited by current scientific knowledge, deep learning methods can independently discover complex biological rules in the data de novo. We trained a gated recurrent neural network (RNN) on human messenger RNA (mRNA) and long noncoding RNA (lncRNA) sequences. Our model, mRNA RNN (mRNN), surpasses state-of-the-art methods at predicting protein-coding potential despite being trained with less data and with no prior concept of what features define mRNAs. To understand what mRNN learned, we probed the network and uncovered several context-sensitive codons highly predictive of coding potential. Our results suggest that gated RNNs can learn complex and long-range patterns in full-length human transcripts, making them ideal for performing a wide range of difficult classification tasks and, most importantly, for harvesting new biological insights from the rising flood of sequencing data.
当前新鉴定的 RNA 转录本数量众多,这为提高编码潜力评估(基因组注释的基石)提供了一个独特的机会,也为机器驱动的生物知识发现提供了机会。虽然基于特征的传统 RNA 分类方法受到当前科学知识的限制,但深度学习方法可以独立地在数据中发现复杂的生物学规则。我们在人类信使 RNA(mRNA)和长非编码 RNA(lncRNA)序列上训练了一个门控递归神经网络(RNN)。尽管我们的模型 mRNN(mRNA RNN)是在使用更少的数据和没有关于哪些特征定义 mRNA 的先验概念的情况下进行训练的,但它在预测蛋白质编码潜力方面超过了最先进的方法。为了了解 mRNN 学到了什么,我们探测了网络,并发现了几个上下文敏感的密码子,它们对编码潜力具有高度预测性。我们的结果表明,门控 RNN 可以学习全长人类转录本中的复杂和长程模式,这使它们非常适合执行广泛的困难分类任务,最重要的是,从不断涌现的测序数据中获取新的生物学见解。