基于特征选择的长链非编码转录本识别：一项比较研究。

Identification of long non-coding transcripts with feature selection: a comparative study.

作者信息

Ventola Giovanna M M, Noviello Teresa M R, D'Aniello Salvatore, Spagnuolo Antonietta, Ceccarelli Michele, Cerulo Luigi

机构信息

Department of Science and Technology, University of Sannio, via Port'Arsa, 11, Benevento, 82100, Italy.

BioGeM, Institute of Genetic Research "Gaetano Salvatore", c.da Camporeale, Ariano Irpino (AV), 83031, Italy.

出版信息

BMC Bioinformatics. 2017 Mar 23;18(1):187. doi: 10.1186/s12859-017-1594-z.

DOI:10.1186/s12859-017-1594-z

PMID:28335739

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5364679/

Abstract

BACKGROUND

The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs.

RESULTS

In this paper we perform a systematic assessment of a wide collection of features extracted from sequence data. We use most of the features proposed in the literature, and we include, as a novel set of features, the occurrence of repeats contained in transposable elements. The aim is to detect signatures (groups of features) able to distinguish long non-coding transcripts from other classes, both protein-coding and non-coding. We evaluate different feature selection algorithms, test for signature stability, and evaluate the prediction ability of a signature with a machine learning algorithm. The study reveals different signatures in human, mouse, and zebrafish, highlighting that some features are shared among species, while others tend to be species-specific. Compared to coding potential tools and similar supervised approaches, including novel signatures, such as those identified here, in a machine learning algorithm improves the prediction performance, in terms of area under precision and recall curve, by 1 to 24%, depending on the species and on the signature.

CONCLUSIONS

Understanding which features are best suited for the prediction of long non-coding RNAs allows for the development of more effective automatic annotation pipelines especially relevant for poorly annotated genomes, such as zebrafish. We provide a web tool that recognizes novel long non-coding RNAs with the obtained signatures from fasta and gtf formats. The tool is available at the following url: http://www.bioinformatics-sannio.org/software/ .

摘要

背景

长链非编码RNA作为许多生物学背景下重要的基因调控因子被揭示，这增加了对高效且强大的计算方法的需求，以便从通过高通量RNA测序数据组装的转录本中识别新型长链非编码RNA。已经提出了几类基于序列的特征来区分编码和非编码转录本。其中，开放阅读框、保守得分、核苷酸排列和RNA二级结构在文献中已成功用于识别基因间长链非编码RNA，这是一类特殊的非编码RNA。

结果

在本文中，我们对从序列数据中提取的大量特征进行了系统评估。我们使用了文献中提出的大多数特征，并将转座元件中包含的重复序列的出现情况作为一组新的特征纳入其中。目的是检测能够将长链非编码转录本与其他类别（包括蛋白质编码和非编码）区分开来的特征标记（特征组）。我们评估了不同的特征选择算法，测试了特征标记的稳定性，并使用机器学习算法评估了特征标记的预测能力。该研究揭示了人类、小鼠和斑马鱼中的不同特征标记，突出表明一些特征在物种间共享，而其他特征则倾向于物种特异性。与编码潜力工具和类似的监督方法相比，在机器学习算法中纳入新的特征标记，如本文确定的那些，根据物种和特征标记的不同，预测性能在精确率和召回率曲线下面积方面提高了1%至24%。

结论

了解哪些特征最适合预测长链非编码RNA有助于开发更有效的自动注释流程，这对于注释不佳的基因组（如斑马鱼）尤其重要。我们提供了一个网络工具，该工具使用从fasta和gtf格式中获得的特征标记来识别新型长链非编码RNA。该工具可通过以下网址获取：http://www.bioinformatics-sannio.org/software/ 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e898/5364679/c91e6458c9b6/12859_2017_1594_Fig1_HTML.jpg

相似文献

Identification of long non-coding transcripts with feature selection: a comparative study.

BMC Bioinformatics. 2017 Mar 23;18(1):187. doi: 10.1186/s12859-017-1594-z.

A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts.

BMC Genomics. 2017 Oct 18;18(1):804. doi: 10.1186/s12864-017-4178-4.

Prediction of plant lncRNA by ensemble machine learning classifiers.

BMC Genomics. 2018 May 2;19(1):316. doi: 10.1186/s12864-018-4665-2.

lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning.

Mol Biosyst. 2015 Mar;11(3):892-7. doi: 10.1039/c4mb00650j. Epub 2015 Jan 15.

PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme.

BMC Bioinformatics. 2014 Sep 19;15(1):311. doi: 10.1186/1471-2105-15-311.

LncRNA-ID: Long non-coding RNA IDentification using balanced random forests.

Bioinformatics. 2015 Dec 15;31(24):3897-905. doi: 10.1093/bioinformatics/btv480. Epub 2015 Aug 26.

PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets.

Comput Biol Med. 2019 Feb;105:169-181. doi: 10.1016/j.compbiomed.2018.12.014. Epub 2019 Jan 4.

An update on LNCipedia: a database for annotated human lncRNA sequences.

Nucleic Acids Res. 2015 Jan;43(Database issue):D174-80. doi: 10.1093/nar/gku1060. Epub 2014 Nov 5.

PredcircRNA: computational classification of circular RNA from other long non-coding RNA using hybrid features.

Mol Biosyst. 2015 Aug;11(8):2219-26. doi: 10.1039/c5mb00214a.

Discriminating cirRNAs from other lncRNAs using a hierarchical extreme learning machine (H-ELM) algorithm with feature selection.

Mol Genet Genomics. 2018 Feb;293(1):137-149. doi: 10.1007/s00438-017-1372-7. Epub 2017 Sep 14.

引用本文的文献

Deep learning tools are top performers in long non-coding RNA prediction.

Brief Funct Genomics. 2022 May 21;21(3):230-241. doi: 10.1093/bfgp/elab045.

Epigenetic Regulation of the Vascular Endothelium by Angiogenic LncRNAs.

Front Genet. 2021 Aug 26;12:668313. doi: 10.3389/fgene.2021.668313. eCollection 2021.

A systematic evaluation of bioinformatics tools for identification of long noncoding RNAs.

RNA. 2021 Jan;27(1):80-98. doi: 10.1261/rna.074724.120. Epub 2020 Oct 14.

A systematic review of the application of machine learning in the detection and classification of transposable elements.

PeerJ. 2019 Dec 18;7:e8311. doi: 10.7717/peerj.8311. eCollection 2019.

PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts.

Genes (Basel). 2019 Sep 3;10(9):672. doi: 10.3390/genes10090672.

A Hybrid Prediction Method for Plant lncRNA-Protein Interaction.

Cells. 2019 May 30;8(6):521. doi: 10.3390/cells8060521.

IRSOM, a reliable identifier of ncRNAs based on supervised self-organizing maps with rejection.

Bioinformatics. 2018 Sep 1;34(17):i620-i628. doi: 10.1093/bioinformatics/bty572.

Detection of long non-coding RNA homology, a comparative study on alignment and alignment-free metrics.

BMC Bioinformatics. 2018 Nov 6;19(1):407. doi: 10.1186/s12859-018-2441-6.

LncRNAs in vascular biology and disease.

Vascul Pharmacol. 2019 Mar;114:145-156. doi: 10.1016/j.vph.2018.01.003. Epub 2018 Feb 6.

A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts.

BMC Genomics. 2017 Oct 18;18(1):804. doi: 10.1186/s12864-017-4178-4.

本文引用的文献

Computational recognition for long non-coding RNA (lncRNA): Software and databases.

Brief Bioinform. 2017 Jan;18(1):9-27. doi: 10.1093/bib/bbv114. Epub 2016 Feb 2.

Many lncRNAs, 5'UTRs, and pseudogenes are translated and some are likely to express functional proteins.

Elife. 2015 Dec 19;4:e08890. doi: 10.7554/eLife.08890.

Annocript: a flexible pipeline for the annotation of transcriptomes able to identify putative long noncoding RNAs.

Bioinformatics. 2015 Jul 1;31(13):2199-201. doi: 10.1093/bioinformatics/btv106. Epub 2015 Feb 19.

lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning.

Mol Biosyst. 2015 Mar;11(3):892-7. doi: 10.1039/c4mb00650j. Epub 2015 Jan 15.

Comparative analysis of transposable elements highlights mobilome diversity and evolution in vertebrates.

Genome Biol Evol. 2015 Jan 9;7(2):567-80. doi: 10.1093/gbe/evv005.

Identification and functional analysis of long non-coding RNAs in mouse cleavage stage embryonic development based on single cell transcriptome data.

BMC Genomics. 2014 Oct 3;15(1):845. doi: 10.1186/1471-2164-15-845.

Natural variability of Kozak sequences correlates with function in a zebrafish model.

PLoS One. 2014 Sep 23;9(9):e108475. doi: 10.1371/journal.pone.0108475. eCollection 2014.

PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme.

BMC Bioinformatics. 2014 Sep 19;15(1):311. doi: 10.1186/1471-2105-15-311.

Long non-coding RNAs as a source of new peptides.

Elife. 2014 Sep 16;3:e03523. doi: 10.7554/eLife.03523.

The RIDL hypothesis: transposable elements as functional domains of long noncoding RNAs.

RNA. 2014 Jul;20(7):959-76. doi: 10.1261/rna.044560.114. Epub 2014 May 21.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于特征选择的长链非编码转录本识别：一项比较研究。

Identification of long non-coding transcripts with feature selection: a comparative study.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献