RNAsamba：基于神经网络的RNA序列蛋白质编码潜力评估

RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences.

作者信息

Camargo Antonio P, Sourkov Vsevolod, Pereira Gonçalo A G, Carazzolle Marcelo F

机构信息

Department of Genetics, Evolution, Microbiology and Immunology, Institute of Biology, University of Campinas, Campinas, SP, 13083-862, Brazil.

Department of Computer Science, ReDNA Labs, Pattaya, Chonburi, 20150, Thailand.

出版信息

NAR Genom Bioinform. 2020 Jan 13;2(1):lqz024. doi: 10.1093/nargab/lqz024. eCollection 2020 Mar.

DOI:10.1093/nargab/lqz024

PMID:33575571

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7671399/

Abstract

The advent of high-throughput sequencing technologies made it possible to obtain large volumes of genetic information, quickly and inexpensively. Thus, many efforts are devoted to unveiling the biological roles of genomic elements, being the distinction between protein-coding and long non-coding RNAs one of the most important tasks. We describe RNAsamba, a tool to predict the coding potential of RNA molecules from sequence information using a neural network-based that models both the whole sequence and the ORF to identify patterns that distinguish coding from non-coding transcripts. We evaluated RNAsamba's classification performance using transcripts coming from humans and several other model organisms and show that it recurrently outperforms other state-of-the-art methods. Our results also show that RNAsamba can identify coding signals in partial-length ORFs and UTR sequences, evidencing that its algorithm is not dependent on complete transcript sequences. Furthermore, RNAsamba can also predict small ORFs, traditionally identified with ribosome profiling experiments. We believe that RNAsamba will enable faster and more accurate biological findings from genomic data of species that are being sequenced for the first time. A user-friendly web interface, the documentation containing instructions for local installation and usage, and the source code of RNAsamba can be found at https://rnasamba.lge.ibi.unicamp.br/.

摘要

高通量测序技术的出现使得快速且低成本地获取大量遗传信息成为可能。因此，许多研究致力于揭示基因组元件的生物学作用，区分蛋白质编码RNA和长链非编码RNA是其中最重要的任务之一。我们介绍了RNAsamba，这是一种利用基于神经网络的方法从序列信息预测RNA分子编码潜力的工具，该方法对整个序列和开放阅读框（ORF）进行建模，以识别区分编码转录本和非编码转录本的模式。我们使用来自人类和其他几种模式生物的转录本评估了RNAsamba的分类性能，结果表明它反复优于其他先进方法。我们的结果还表明，RNAsamba可以在部分长度的ORF和非翻译区（UTR）序列中识别编码信号，证明其算法不依赖于完整的转录本序列。此外，RNAsamba还可以预测传统上通过核糖体谱实验鉴定的小ORF。我们相信，RNAsamba将使人们能够从首次测序物种的基因组数据中更快、更准确地获得生物学发现。可在https://rnasamba.lge.ibi.unicamp.br/找到用户友好的网页界面、包含本地安装和使用说明的文档以及RNAsamba的源代码。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ffa/7671399/d73dea6782ad/lqz024fig1.jpg

相似文献

RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences.RNAsamba：基于神经网络的RNA序列蛋白质编码潜力评估

NAR Genom Bioinform. 2020 Jan 13;2(1):lqz024. doi: 10.1093/nargab/lqz024. eCollection 2020 Mar.

A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts.基于支持向量机的方法区分长非编码 RNA 与蛋白质编码转录本。

BMC Genomics. 2017 Oct 18;18(1):804. doi: 10.1186/s12864-017-4178-4.

PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme.PLEK：一种基于改进的k-mer方案预测长链非编码RNA和信使RNA的工具。

BMC Bioinformatics. 2014 Sep 19;15(1):311. doi: 10.1186/1471-2105-15-311.

Common and phylogenetically widespread coding for peptides by bacterial small RNAs.细菌小RNA对肽进行编码的现象普遍存在且在系统发育上广泛存在。

BMC Genomics. 2017 Jul 21;18(1):553. doi: 10.1186/s12864-017-3932-y.

Regulatory roles of 5' UTR and ORF-internal RNAs detected by 3' end mapping.通过 3' 端映射检测到的 5'UTR 和 ORF 内部 RNA 的调控作用。

Elife. 2021 Jan 18;10:e62438. doi: 10.7554/eLife.62438.

CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine.CPC：利用序列特征和支持向量机评估转录本的蛋白质编码潜力。

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W345-9. doi: 10.1093/nar/gkm391.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Identification of long non-coding transcripts with feature selection: a comparative study.基于特征选择的长链非编码转录本识别：一项比较研究。

BMC Bioinformatics. 2017 Mar 23;18(1):187. doi: 10.1186/s12859-017-1594-z.

lncRNADetector: a bioinformatics pipeline for long non-coding RNA identification and MAPslnc: a repository of medicinal and aromatic plant lncRNAs.lncRNADetector：一个长非编码 RNA 识别的生物信息学管道和 MAPslnc：药用和芳香植物 lncRNAs 的存储库。

RNA Biol. 2021 Dec;18(12):2290-2295. doi: 10.1080/15476286.2021.1899673. Epub 2021 Mar 18.

引用本文的文献

Identification and functional analysis of growth rate associated long non-coding RNAs in .鉴定和功能分析……中与生长速率相关的长链非编码RNA

Comput Struct Biotechnol J. 2025 Apr 22;27:1693-1705. doi: 10.1016/j.csbj.2025.04.028. eCollection 2025.

InfoScan: A New Transcript Identification Tool Based on scRNA-Seq and Its Application in Glioblastoma.InfoScan：一种基于单细胞RNA测序的新型转录本识别工具及其在胶质母细胞瘤中的应用

Int J Mol Sci. 2025 Feb 28;26(5):2208. doi: 10.3390/ijms26052208.

Full-length transcriptome sequencing of seven tissues of GuShi chickens.固始鸡七个组织的全长转录组测序

Poult Sci. 2025 Feb;104(2):104697. doi: 10.1016/j.psj.2024.104697. Epub 2024 Dec 19.

Structural and biochemical insights of xylose MFS and SWEET transporters in microbial cell factories: challenges to lignocellulosic hydrolysates fermentation.微生物细胞工厂中木糖MFS和SWEET转运蛋白的结构与生化见解：木质纤维素水解物发酵面临的挑战

Front Microbiol. 2024 Sep 27;15:1452240. doi: 10.3389/fmicb.2024.1452240. eCollection 2024.

Comparative Tissue Identification and Characterization of Long Non-Coding RNAs in the Globally Distributed Blue Shark .全球分布的蓝鲨中长链非编码RNA的组织鉴定与特征比较

Life (Basel). 2024 Sep 11;14(9):1144. doi: 10.3390/life14091144.

LncRNA-encoded peptides in cancer.lncRNA 编码肽在癌症中的作用。

J Hematol Oncol. 2024 Aug 12;17(1):66. doi: 10.1186/s13045-024-01591-0.

A survey of experimental and computational identification of small proteins.小蛋白的实验和计算鉴定综述。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae345.

Big data and deep learning for RNA biology.大数据和深度学习在 RNA 生物学中的应用。

Exp Mol Med. 2024 Jun;56(6):1293-1321. doi: 10.1038/s12276-024-01243-w. Epub 2024 Jun 14.

Characterization of transcriptome changes in saline stress adaptation on Leuciscus merzbacheri using PacBio Iso-Seq and RNA-Seq.利用 PacBio Iso-Seq 和 RNA-Seq 技术研究黄河雅罗鱼适应盐胁迫的转录组变化特征。

DNA Res. 2024 Jun 1;31(3). doi: 10.1093/dnares/dsae019.

ntEmbd: Deep learning embedding for nucleotide sequences.ntEmbd：核苷酸序列的深度学习嵌入

bioRxiv. 2024 May 2:2024.04.30.591806. doi: 10.1101/2024.04.30.591806.

本文引用的文献

Ensembl 2020.Ensembl 2020.

Nucleic Acids Res. 2020 Jan 8;48(D1):D682-D688. doi: 10.1093/nar/gkz966.

Translation of Small Open Reading Frames: Roles in Regulation and Evolutionary Innovation.小开放阅读框的翻译：在调控和进化创新中的作用。

Trends Genet. 2019 Mar;35(3):186-198. doi: 10.1016/j.tig.2018.12.003. Epub 2018 Dec 31.

Detection of long non-coding RNA homology, a comparative study on alignment and alignment-free metrics.检测长非编码 RNA 同源性：对齐和无对齐度量的比较研究。

BMC Bioinformatics. 2018 Nov 6;19(1):407. doi: 10.1186/s12859-018-2441-6.

GENCODE reference annotation for the human and mouse genomes.GENCODE 人类和小鼠基因组参考注释。

Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773. doi: 10.1093/nar/gky955.

The Pfam protein families database in 2019.2019 年 Pfam 蛋白质家族数据库。

Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.

The Ly6/uPAR protein Bouncer is necessary and sufficient for species-specific fertilization.Ly6/uPAR 蛋白 Bouncer 对于种间特异性受精是必要且充分的。

Science. 2018 Sep 7;361(6406):1029-1033. doi: 10.1126/science.aat7113.

A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential.深度递归神经网络发现复杂的生物学规则，以破译 RNA 蛋白编码潜力。

Nucleic Acids Res. 2018 Sep 19;46(16):8105-8113. doi: 10.1093/nar/gky567.

LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning.LncADeep：一种基于深度学习的从头鉴定长非编码 RNA 及其功能注释工具。

Bioinformatics. 2018 Nov 15;34(22):3825-3834. doi: 10.1093/bioinformatics/bty428.

LncRNAnet: long non-coding RNA identification using deep learning.LncRNAnet：使用深度学习进行长非编码 RNA 鉴定。

Bioinformatics. 2018 Nov 15;34(22):3889-3897. doi: 10.1093/bioinformatics/bty418.

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.MMseqs2支持进行灵敏的蛋白质序列搜索，以分析海量数据集。

Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988. Epub 2017 Oct 16.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

RNAsamba：基于神经网络的RNA序列蛋白质编码潜力评估

RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献