School of Computer Science and Engineering, Nanyang Techonological University, 50 Nanyang Avenue, Singapore.
School of Mathematical Sciences, Dalian University of Technology, No.2 Linggong Road, Dalian, China.
Brief Bioinform. 2021 Mar 22;22(2):2073-2084. doi: 10.1093/bib/bbaa039.
The development of deep sequencing technologies has led to the discovery of novel transcripts. Many in silico methods have been developed to assess the coding potential of these transcripts to further investigate their functions. Existing methods perform well on distinguishing majority long noncoding RNAs (lncRNAs) and coding RNAs (mRNAs) but poorly on RNAs with small open reading frames (sORFs). Here, we present DeepCPP (deep neural network for coding potential prediction), a deep learning method for RNA coding potential prediction. Extensive evaluations on four previous datasets and six new datasets constructed in different species show that DeepCPP outperforms other state-of-the-art methods, especially on sORF type data, which overcomes the bottleneck of sORF mRNA identification by improving more than 4.31, 37.24 and 5.89% on its accuracy for newly discovered human, vertebrate and insect data, respectively. Additionally, we also revealed that discontinuous k-mer, and our newly proposed nucleotide bias and minimal distribution similarity feature selection method play crucial roles in this classification problem. Taken together, DeepCPP is an effective method for RNA coding potential prediction.
深度测序技术的发展导致了新型转录本的发现。已经开发了许多计算方法来评估这些转录本的编码潜力,以进一步研究它们的功能。现有的方法在区分大多数长非编码 RNA(lncRNA)和编码 RNA(mRNA)方面表现良好,但在具有小开放阅读框(sORF)的 RNA 方面表现不佳。在这里,我们提出了 DeepCPP(用于编码潜力预测的深度神经网络),这是一种用于 RNA 编码潜力预测的深度学习方法。在四个以前的数据集和六个在不同物种中构建的新数据集上进行的广泛评估表明,DeepCPP 优于其他最先进的方法,特别是在 sORF 类型数据上,通过提高超过 4.31%、37.24%和 5.89%的准确性,分别在新发现的人类、脊椎动物和昆虫数据上克服了 sORF mRNA 识别的瓶颈。此外,我们还揭示了不连续的 k-mer 以及我们新提出的核苷酸偏差和最小分布相似性特征选择方法在这个分类问题中起着关键作用。总的来说,DeepCPP 是一种有效的 RNA 编码潜力预测方法。