Mao Rui, Liang Chun, Zhang Yang, Hao Xingan, Li Jinyan
College of Information Engineering, Northwest A&F University, Yangling, China.
Department of Biology, Miami University, Oxford, OH, United States.
Front Plant Sci. 2017 Oct 9;8:1728. doi: 10.3389/fpls.2017.01728. eCollection 2017.
Intron retention, one of the most prevalent alternative splicing events in plants, can lead to introns retained in mature mRNAs. However, in comparison with constitutively spliced introns (CSIs), the relevantly distinguishable features for retained introns (RIs) are still poorly understood. This work proposes a computational pipeline to discover novel RIs from multiple next-generation RNA sequencing (RNA-Seq) datasets of . Using this pipeline, we detected 3,472 novel RIs from 18 RNA-Seq datasets and re-confirmed 1,384 RIs which are currently annotated in the TAIR10 database. We also use the expression of intron-containing isoforms as a new feature in addition to the conventional features. Based on these features, RIs are highly distinguishable from CSIs by machine learning methods, especially when the expressional odds of retention (i.e., the expression ratio of the RI-containing isoforms relative to the isoforms without RIs for the same gene) reaches to or larger than 50/50. In this case, the RIs and CSIs can be clearly separated by the Random Forest with an outstanding performance of 0.95 on AUC (the area under a receiver operating characteristics curve). The closely related characteristics to the RIs include the low strength of splice sites, high similarity with the flanking exon sequences, low occurrence percentage of YTRAY near the acceptor site, existence of putative intronic splicing silencers (ISSs, i.e., AG/GA-rich motifs) and intronic splicing enhancers (ISEs, i.e., TTTT-containing motifs), and enrichment of Serine/Arginine-Rich (SR) proteins and heterogeneous nuclear ribonucleoparticle proteins (hnRNPs).
内含子保留是植物中最普遍的可变剪接事件之一,可导致内含子保留在成熟mRNA中。然而,与组成型剪接内含子(CSI)相比,保留内含子(RI)的相关显著特征仍知之甚少。这项工作提出了一种计算流程,用于从多个下一代RNA测序(RNA-Seq)数据集中发现新的RI。使用该流程,我们从18个RNA-Seq数据集中检测到3472个新的RI,并重新确认了1384个目前在TAIR10数据库中注释的RI。除了传统特征外,我们还将含内含子异构体的表达作为一个新特征。基于这些特征,通过机器学习方法,RI与CSI具有高度可区分性,特别是当保留的表达几率(即同一基因中含RI的异构体相对于不含RI的异构体的表达比率)达到或大于50/50时。在这种情况下,随机森林可以将RI和CSI清晰地分开,在AUC(受试者工作特征曲线下的面积)上具有0.95的出色表现。与RI密切相关的特征包括剪接位点强度低、与侧翼外显子序列高度相似、受体位点附近YTRAY出现百分比低、存在假定的内含子剪接沉默子(ISS,即富含AG/GA的基序)和内含子剪接增强子(ISE,即含TTTT的基序),以及富含丝氨酸/精氨酸(SR)蛋白和不均一核核糖核蛋白(hnRNP)。