Liu Yuan, Chen Dasheng, Su Ran, Chen Wei, Wei Leyi
College of Intelligence and Computing, Tianjin University, Tianjin, China.
Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan, China.
Front Bioeng Biotechnol. 2020 Mar 31;8:227. doi: 10.3389/fbioe.2020.00227. eCollection 2020.
RNA 5-hydroxymethylcytosine (5hmC) modification plays an important role in a series of biological processes. Characterization of its distributions in transcriptome is fundamentally important to reveal the biological functions of 5hmC. Sequencing-based technologies allow the high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Thus, there is an urgent need to develop more effective and efficient computational methods, at least complementary to the high-throughput technologies. In this study, we developed iRNA5hmC, a computational predictive protocol to identify RNA 5hmC sites using machine learning. In this predictor, we introduced a sequence-based feature algorithm consisting of two feature representations, (1) -mer spectrum and (2) positional nucleotide binary vector, to capture the sequential characteristics of 5hmC sites. Afterward, we utilized a two-stage feature space optimization strategy to improve the feature representation ability, and trained a predictive model using support vector machine (SVM). Our feature analysis results showed that feature optimization can help to capture the most discriminative features. As compared to well-known existing feature descriptors, our proposed representations can more accurately separate true 5hmC from non-5hmC sites. To the best of our knowledge, iRNA5hmC is the first RNA 5hmC predictor that enables to make predictions based on RNA primary sequences only, without any need of prior experimental knowledge. Importantly, we have established an easy-to-use webserver which is currently available at http://server.malab.cn/iRNA5hmC. We expect it has potential to be a useful tool for the prediction of 5hmC sites.
RNA 5-羟甲基胞嘧啶(5hmC)修饰在一系列生物学过程中发挥着重要作用。表征其在转录组中的分布对于揭示5hmC的生物学功能至关重要。基于测序的技术能够高通量鉴定5hmC;然而,这些技术劳动强度大、耗时且昂贵。因此,迫切需要开发更有效且高效的计算方法,至少作为高通量技术的补充。在本研究中,我们开发了iRNA5hmC,这是一种利用机器学习来识别RNA 5hmC位点的计算预测方案。在这个预测器中,我们引入了一种基于序列的特征算法,该算法由两种特征表示组成:(1)k-mer谱和(2)位置核苷酸二元向量,以捕捉5hmC位点的序列特征。随后,我们采用两阶段特征空间优化策略来提高特征表示能力,并使用支持向量机(SVM)训练预测模型。我们的特征分析结果表明,特征优化有助于捕捉最具判别力的特征。与现有的知名特征描述符相比,我们提出的表示能够更准确地将真正的5hmC与非5hmC位点区分开来。据我们所知,iRNA5hmC是首个仅基于RNA一级序列进行预测、无需任何先验实验知识的RNA 5hmC预测器。重要的是,我们建立了一个易于使用的网络服务器,目前可在http://server.malab.cn/iRNA5hmC访问。我们期望它有潜力成为预测5hmC位点的有用工具。