Suppr超能文献

LncRNA-ID:使用平衡随机森林进行长链非编码RNA识别

LncRNA-ID: Long non-coding RNA IDentification using balanced random forests.

作者信息

Achawanantakun Rujira, Chen Jiao, Sun Yanni, Zhang Yuan

机构信息

Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.

出版信息

Bioinformatics. 2015 Dec 15;31(24):3897-905. doi: 10.1093/bioinformatics/btv480. Epub 2015 Aug 26.

Abstract

MOTIVATION

Long non-coding RNAs (lncRNAs), which are non-coding RNAs of length above 200 nucleotides, play important biological functions such as gene expression regulation. To fully reveal the functions of lncRNAs, a fundamental step is to annotate them in various species. However, as lncRNAs tend to encode one or multiple open reading frames, it is not trivial to distinguish these long non-coding transcripts from protein-coding genes in transcriptomic data.

RESULTS

In this work, we design a new tool that calculates the coding potential of a transcript using a machine learning model (random forest) based on multiple features including sequence characteristics of putative open reading frames, translation scores based on ribosomal coverage, and conservation against characterized protein families. The experimental results show that our tool competes favorably with existing coding potential computation tools in lncRNA identification.

AVAILABILITY AND IMPLEMENTATION

The scripts and data can be downloaded at https://github.com/zhangy72/LncRNA-ID.

摘要

动机

长链非编码RNA(lncRNA)是长度超过200个核苷酸的非编码RNA,具有基因表达调控等重要生物学功能。为了全面揭示lncRNA的功能,一个基本步骤是在各种物种中对它们进行注释。然而,由于lncRNA倾向于编码一个或多个开放阅读框,在转录组数据中区分这些长链非编码转录本与蛋白质编码基因并非易事。

结果

在这项工作中,我们设计了一种新工具,该工具使用基于多种特征的机器学习模型(随机森林)来计算转录本的编码潜力,这些特征包括推定开放阅读框的序列特征、基于核糖体覆盖的翻译得分以及针对已鉴定蛋白质家族的保守性。实验结果表明,我们的工具在lncRNA识别方面与现有的编码潜力计算工具相比具有优势。

可用性和实现方式

脚本和数据可在https://github.com/zhangy72/LncRNA-ID下载。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验