Department of Biology, McMaster University, Hamilton, Ontario, Canada, L8S 4K1.
Department of Biology, McMaster University, Hamilton, Ontario, Canada, L8S 4K1
G3 (Bethesda). 2019 Aug 8;9(8):2511-2520. doi: 10.1534/g3.119.400201.
Long non-coding RNAs (lncRNAs) represent a diverse class of regulatory loci with roles in development and stress responses throughout all kingdoms of life. LncRNAs, however, remain under-studied in plants compared to animal systems. To address this deficiency, we applied a machine learning prediction tool, Classifying RNA by Ensemble Machine learning Algorithm (CREMA), to analyze RNAseq data from 11 plant species chosen to represent a wide range of evolutionary histories. Transcript sequences of all expressed and/or annotated loci from plants grown in unstressed (control) conditions were assembled and input into CREMA for comparative analyses. On average, 6.4% of the plant transcripts were identified by CREMA as encoding lncRNAs. Gene annotation associated with the transcripts showed that up to 99% of all predicted lncRNAs for and were missing from their reference annotations whereas the reference annotation for the genetic model plant contains 96% of all predicted lncRNAs for this species. Thus a reliance on reference annotations for use in lncRNA research in less well-studied plants can be impeded by the near absence of annotations associated with these regulatory transcripts. Moreover, our work using phylogenetic signal analyses suggests that molecular traits of plant lncRNAs display different evolutionary patterns than all other transcripts in plants and have molecular traits that do not follow a classic evolutionary pattern. Specifically, GC content was the only tested trait of lncRNAs with consistently significant and high phylogenetic signal, contrary to high signal in all tested molecular traits for the other transcripts in our tested plant species.
长非编码 RNA(lncRNA)是一类具有广泛功能的调控基因,在生命的各个领域都发挥着重要作用,包括发育和应激反应。然而,与动物系统相比,植物中的 lncRNA 研究还相对较少。为了弥补这一不足,我们应用了一种机器学习预测工具——Classifying RNA by Ensemble Machine learning Algorithm(CREMA),对 11 种代表广泛进化历史的植物物种的 RNAseq 数据进行了分析。从在无应激(对照)条件下生长的植物中获得的所有表达和/或注释基因座的转录本序列被组装并输入 CREMA 进行比较分析。平均而言,CREMA 将 6.4%的植物转录本鉴定为编码 lncRNA。与转录本相关的基因注释表明,在 和 中,高达 99%的预测 lncRNA 都不存在于它们的参考注释中,而遗传模式植物 的参考注释包含了该物种所有预测 lncRNA 的 96%。因此,在研究研究较少的植物中的 lncRNA 时,如果仅仅依赖参考注释,可能会受到这些调控转录本相关注释缺失的阻碍。此外,我们使用系统发育信号分析的工作表明,植物 lncRNA 的分子特征与植物中所有其他转录本的进化模式不同,并且具有不遵循经典进化模式的分子特征。具体而言,GC 含量是唯一具有一致显著高系统发育信号的 lncRNA 测试特征,与我们测试的植物物种中所有其他转录本的所有测试分子特征的高信号形成对比。