Chen Lei, Zhang Yu-Hang, Huang Guohua, Pan Xiaoyong, Wang ShaoPeng, Huang Tao, Cai Yu-Dong
College of Life Science, Shanghai University, Shanghai, 200444, People's Republic of China.
College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.
Mol Genet Genomics. 2018 Feb;293(1):137-149. doi: 10.1007/s00438-017-1372-7. Epub 2017 Sep 14.
As non-coding RNAs, circular RNAs (cirRNAs) and long non-coding RNAs (lncRNAs) have attracted an increasing amount of attention. They have been confirmed to participate in many biological processes, including playing roles in transcriptional regulation, regulating protein-coding genes, and binding to RNA-associated proteins. Until now, the differences between these two types of non-coding RNAs have not been fully uncovered. It is still quite difficult to detect cirRNAs from other lncRNAs using simple techniques. In this study, we investigated these two types of non-coding RNAs using several computational methods. The purpose was to extract important factors that could distinguish cirRNAs from other lncRNAs and build an effective classification model to distinguish them. First, we collected cirRNAs, lncRNAs and their representations from a previous study, in which each cirRNA or lncRNA was represented by 188 features derived from its graph representation, sequence and conservation properties. Second, these features were analyzed by the minimum redundancy maximum relevance (mRMR) method. The obtained mRMR feature list, incremental feature selection method and hierarchical extreme learning machine algorithm were employed to build an optimal classification model with sensitivity of 0.703, specificity of 0.850, accuracy of 0.789 and a Matthews correlation coefficient of 0.561. Finally, we analyzed the 16 most important features. Of them, the sequences and structures of the RNA molecule were top ranking, implying they can be potential indicators of differences between cirRNAs and other lncRNAs. Meanwhile, other features of evolutionary conversation, sequence consecution were also important.
作为非编码RNA,环状RNA(cirRNA)和长链非编码RNA(lncRNA)已引起越来越多的关注。它们已被证实参与许多生物学过程,包括在转录调控中发挥作用、调节蛋白质编码基因以及与RNA相关蛋白结合。到目前为止,这两种非编码RNA之间的差异尚未完全揭示。使用简单技术从其他lncRNA中检测cirRNA仍然相当困难。在本研究中,我们使用几种计算方法研究了这两种非编码RNA。目的是提取能够区分cirRNA与其他lncRNA的重要因素,并建立一个有效的分类模型来区分它们。首先,我们从先前的一项研究中收集了cirRNA、lncRNA及其特征表示,其中每个cirRNA或lncRNA由从其图形表示、序列和保守特性衍生的188个特征表示。其次,通过最小冗余最大相关性(mRMR)方法对这些特征进行分析。使用获得的mRMR特征列表、增量特征选择方法和分层极限学习机算法构建了一个最佳分类模型,其灵敏度为0.703,特异性为0.850,准确率为0.789,马修斯相关系数为0.561。最后,我们分析了16个最重要的特征。其中,RNA分子的序列和结构排名靠前,这意味着它们可能是cirRNA与其他lncRNA之间差异的潜在指标。同时,进化保守性、序列连续性等其他特征也很重要。