LncRNA-ID：使用平衡随机森林进行长链非编码RNA识别

LncRNA-ID: Long non-coding RNA IDentification using balanced random forests.

作者信息

Achawanantakun Rujira, Chen Jiao, Sun Yanni, Zhang Yuan

机构信息

Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.

出版信息

Bioinformatics. 2015 Dec 15;31(24):3897-905. doi: 10.1093/bioinformatics/btv480. Epub 2015 Aug 26.

DOI:10.1093/bioinformatics/btv480

Abstract

MOTIVATION

Long non-coding RNAs (lncRNAs), which are non-coding RNAs of length above 200 nucleotides, play important biological functions such as gene expression regulation. To fully reveal the functions of lncRNAs, a fundamental step is to annotate them in various species. However, as lncRNAs tend to encode one or multiple open reading frames, it is not trivial to distinguish these long non-coding transcripts from protein-coding genes in transcriptomic data.

RESULTS

In this work, we design a new tool that calculates the coding potential of a transcript using a machine learning model (random forest) based on multiple features including sequence characteristics of putative open reading frames, translation scores based on ribosomal coverage, and conservation against characterized protein families. The experimental results show that our tool competes favorably with existing coding potential computation tools in lncRNA identification.

AVAILABILITY AND IMPLEMENTATION

The scripts and data can be downloaded at https://github.com/zhangy72/LncRNA-ID.

摘要

动机

长链非编码RNA（lncRNA）是长度超过200个核苷酸的非编码RNA，具有基因表达调控等重要生物学功能。为了全面揭示lncRNA的功能，一个基本步骤是在各种物种中对它们进行注释。然而，由于lncRNA倾向于编码一个或多个开放阅读框，在转录组数据中区分这些长链非编码转录本与蛋白质编码基因并非易事。

结果

在这项工作中，我们设计了一种新工具，该工具使用基于多种特征的机器学习模型（随机森林）来计算转录本的编码潜力，这些特征包括推定开放阅读框的序列特征、基于核糖体覆盖的翻译得分以及针对已鉴定蛋白质家族的保守性。实验结果表明，我们的工具在lncRNA识别方面与现有的编码潜力计算工具相比具有优势。

可用性和实现方式

脚本和数据可在https://github.com/zhangy72/LncRNA-ID下载。

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验