Suppr超能文献

利用随机森林和支持向量机对拟南芥中保留内含子和组成型剪接内含子进行比较分析。

Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.

作者信息

Mao Rui, Raj Kumar Praveen Kumar, Guo Cheng, Zhang Yang, Liang Chun

机构信息

College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling, Shaanxi, China; College of Information Engineering, Northwest A&F University, Yangling, Shaanxi, China; Department of Biology, Miami University, Oxford, Ohio, United States of America.

Department of Biology, Miami University, Oxford, Ohio, United States of America.

出版信息

PLoS One. 2014 Aug 11;9(8):e104049. doi: 10.1371/journal.pone.0104049. eCollection 2014.

Abstract

One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter [Formula: see text] in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.

摘要

前体mRNA转录后修饰的重要模式之一是可变剪接。可变剪接允许通过利用不同的剪接位点从单个基因产生许多不同的成熟mRNA转录本。在拟南芥等植物中,最常见的可变剪接类型是内含子保留。过去的许多研究集中于保留内含子(RI)在不同基因区域中的位置分布及其表达调控,而使用机器学习方法对组成型剪接内含子(CSI)中的RI进行的系统分类很少。我们使用随机森林和具有径向基核函数(RBF)的支持向量机(SVM)来区分拟南芥中的这两种内含子类型。通过比较TAIR10中所有注释mRNA的内含子坐标,我们获得了高质量的实验数据。为了区分RI和CSI,我们研究了RI与CSI相比的独特特征,最终提取了37个定量特征:内含子的局部和全局核苷酸序列特征、频繁基序、剪接位点的信号强度以及内含子序列与其侧翼区域之间的相似性。我们证明,与其他四种方法相比,我们提出的特征提取方法在有效区分RI和CSI方面更准确。基于粒子群优化算法(PSOSVM)设置了SVM中的最优惩罚参数C和RBF核参数[公式:见原文]。我们的分类性能显示F值为80.8%(随机森林)和77.4%(PSOSVM)。不仅获得了RI的基本序列特征和位置分布特征,还基于我们的特征提取方法预测了内含子剪接中的假定调控基序。显然,我们的研究将有助于更好地理解内含子保留所涉及的潜在机制。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d7/4128822/d27e844fe81b/pone.0104049.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验