Camargo Antonio P, Sourkov Vsevolod, Pereira Gonçalo A G, Carazzolle Marcelo F
Department of Genetics, Evolution, Microbiology and Immunology, Institute of Biology, University of Campinas, Campinas, SP, 13083-862, Brazil.
Department of Computer Science, ReDNA Labs, Pattaya, Chonburi, 20150, Thailand.
NAR Genom Bioinform. 2020 Jan 13;2(1):lqz024. doi: 10.1093/nargab/lqz024. eCollection 2020 Mar.
The advent of high-throughput sequencing technologies made it possible to obtain large volumes of genetic information, quickly and inexpensively. Thus, many efforts are devoted to unveiling the biological roles of genomic elements, being the distinction between protein-coding and long non-coding RNAs one of the most important tasks. We describe RNAsamba, a tool to predict the coding potential of RNA molecules from sequence information using a neural network-based that models both the whole sequence and the ORF to identify patterns that distinguish coding from non-coding transcripts. We evaluated RNAsamba's classification performance using transcripts coming from humans and several other model organisms and show that it recurrently outperforms other state-of-the-art methods. Our results also show that RNAsamba can identify coding signals in partial-length ORFs and UTR sequences, evidencing that its algorithm is not dependent on complete transcript sequences. Furthermore, RNAsamba can also predict small ORFs, traditionally identified with ribosome profiling experiments. We believe that RNAsamba will enable faster and more accurate biological findings from genomic data of species that are being sequenced for the first time. A user-friendly web interface, the documentation containing instructions for local installation and usage, and the source code of RNAsamba can be found at https://rnasamba.lge.ibi.unicamp.br/.
高通量测序技术的出现使得快速且低成本地获取大量遗传信息成为可能。因此,许多研究致力于揭示基因组元件的生物学作用,区分蛋白质编码RNA和长链非编码RNA是其中最重要的任务之一。我们介绍了RNAsamba,这是一种利用基于神经网络的方法从序列信息预测RNA分子编码潜力的工具,该方法对整个序列和开放阅读框(ORF)进行建模,以识别区分编码转录本和非编码转录本的模式。我们使用来自人类和其他几种模式生物的转录本评估了RNAsamba的分类性能,结果表明它反复优于其他先进方法。我们的结果还表明,RNAsamba可以在部分长度的ORF和非翻译区(UTR)序列中识别编码信号,证明其算法不依赖于完整的转录本序列。此外,RNAsamba还可以预测传统上通过核糖体谱实验鉴定的小ORF。我们相信,RNAsamba将使人们能够从首次测序物种的基因组数据中更快、更准确地获得生物学发现。可在https://rnasamba.lge.ibi.unicamp.br/找到用户友好的网页界面、包含本地安装和使用说明的文档以及RNAsamba的源代码。