Department of Computer Science, Bioinformatics Graduate Program (PPGBIOINFO), Federal University of Technology - Paraná, UTFPR, Campus Cornélio, Procópio, Brazil and Informatics and Knowledge Management Graduate Program, Universidade Nove de Julho, São Paulo, Brazil.
Informatics and Knowledge Management Graduate Program, Universidade Nove de Julho, São Paulo, Brazil.
Brief Bioinform. 2019 Mar 25;20(2):682-689. doi: 10.1093/bib/bby034.
Long noncoding RNAs (lncRNAs) correspond to a eukaryotic noncoding RNA class that gained great attention in the past years as a higher layer of regulation for gene expression in cells. There is, however, a lack of specific computational approaches to reliably predict lncRNA in plants, which contrast the variety of prediction tools available for mammalian lncRNAs. This distinction is not that obvious, given that biological features and mechanisms generating lncRNAs in the cell are likely different between animals and plants. Considering this, we present a machine learning analysis and a classifier approach called RNAplonc (https://github.com/TatianneNegri/RNAplonc/) to identify lncRNAs in plants.
Our feature selection analysis considered 5468 features, and it used only 16 features to robustly identify lncRNA with the REPTree algorithm. That was the base to create the model and train it with lncRNA and mRNA data from five plant species (thale cress, cucumber, soybean, poplar and Asian rice). After an extensive comparison with other tools largely used in plants (CPC, CPC2, CPAT and PLncPRO), we found that RNAplonc produced more reliable lncRNA predictions from plant transcripts with 87.5% of the best result in eight tests in eight species from the GreeNC database and four independent studies in monocotyledonous (Brachypodium) and eudicotyledonous (Populus and Gossypium) species.
长非编码 RNA(lncRNAs)对应于真核非编码 RNA 类,近年来作为细胞中基因表达的更高层次调控,引起了极大的关注。然而,缺乏可靠预测植物 lncRNA 的特定计算方法,与哺乳动物 lncRNAs 可用的各种预测工具形成对比。这种区别并不明显,因为在动物和植物细胞中产生 lncRNA 的生物学特征和机制可能不同。考虑到这一点,我们提出了一种机器学习分析和分类器方法,称为 RNAplonc(https://github.com/TatianneNegri/RNAplonc/),用于鉴定植物中的 lncRNA。
我们的特征选择分析考虑了 5468 个特征,仅使用 16 个特征来使用 REPTree 算法稳健地识别 lncRNA。这是创建模型并使用来自五种植物物种(拟南芥、黄瓜、大豆、杨树和亚洲稻)的 lncRNA 和 mRNA 数据对其进行训练的基础。与广泛用于植物的其他工具(CPC、CPC2、CPAT 和 PLncPRO)进行了广泛比较后,我们发现 RNAplonc 从 GreeNC 数据库中的 8 个物种的 8 个测试和单子叶植物(拟南芥)和双子叶植物(杨树和棉属)的 4 个独立研究中产生了更可靠的植物转录物 lncRNA 预测,87.5%的结果最好。