通过结合多种序列特征预测RNA序列中的m5C修饰

Prediction of m5C Modifications in RNA Sequences by Combining Multiple Sequence Features.

作者信息

Dou Lijun, Li Xiaoling, Ding Hui, Xu Lei, Xiang Huaikun

机构信息

School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.

Department of Oncology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China.

出版信息

Mol Ther Nucleic Acids. 2020 Sep 4;21:332-342. doi: 10.1016/j.omtn.2020.06.004. Epub 2020 Jun 10.

DOI:10.1016/j.omtn.2020.06.004

PMID:32645685

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7340967/

Abstract

5-Methylcytosine (m5C) is a well-known post-transcriptional modification that plays significant roles in biological processes, such as RNA metabolism, tRNA recognition, and stress responses. Traditional high-throughput techniques on identification of m5C sites are usually time consuming and expensive. In addition, the number of RNA sequences shows explosive growth in the post-genomic era. Thus, machine-learning-based methods are urgently requested to quickly predict RNA m5C modifications with high accuracy. Here, we propose a noval support-vector-machine (SVM)-based tool, called iRNA-m5C_SVM, by combining multiple sequence features to identify m5C sites in Arabidopsis thaliana. Eight kinds of popular feature-extraction methods were first investigated systematically. Then, four well-performing features were incorporated to construct a comprehensive model, including position-specific propensity (PSP) (PSNP, PSDP, and PSTP, associated with frequencies of nucleotides, dinucleotides, and trinucleotides, respectively), nucleotide composition (nucleic acid, di-nucleotide, and tri-nucleotide compositions; NAC, DNC, and TNC, respectively), electron-ion interaction pseudopotentials of trinucleotide (PseEIIPs), and general parallel correlation pseudo-dinucleotide composition (PC-PseDNC-general). Evaluated accuracies over 10-fold cross-validation and independent tests achieved 73.06% and 80.15%, respectively, which showed the best predictive performances in A. thaliana among existing models. It is believed that the proposed model in this work can be a promising alternative for further research on m5C modification sites in plant.

摘要

5-甲基胞嘧啶（m5C）是一种广为人知的转录后修饰，在RNA代谢、tRNA识别和应激反应等生物过程中发挥着重要作用。传统的用于鉴定m5C位点的高通量技术通常既耗时又昂贵。此外，在后基因组时代，RNA序列的数量呈爆炸式增长。因此，迫切需要基于机器学习的方法来快速、准确地预测RNA的m5C修饰。在此，我们通过结合多种序列特征，提出了一种基于支持向量机（SVM）的新型工具iRNA-m5C_SVM，用于鉴定拟南芥中的m5C位点。首先系统地研究了八种流行的特征提取方法。然后，纳入了四种性能良好的特征来构建一个综合模型，包括位置特异性倾向（PSP）（分别与核苷酸、二核苷酸和三核苷酸频率相关的PSNP、PSDP和PSTP）、核苷酸组成（分别为核酸、二核苷酸和三核苷酸组成；NAC、DNC和TNC）、三核苷酸的电子-离子相互作用赝势（PseEIIPs）以及广义平行相关伪二核苷酸组成（PC-PseDNC-general）。在10折交叉验证和独立测试中的评估准确率分别达到了73.06%和80.15%，这表明在现有模型中，该模型在拟南芥中具有最佳的预测性能。相信这项工作中提出的模型可以成为进一步研究植物m5C修饰位点的一个有前景的替代方案。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过结合多种序列特征预测RNA序列中的m5C修饰

Prediction of m5C Modifications in RNA Sequences by Combining Multiple Sequence Features.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

通过结合多种序列特征预测RNA序列中的m5C修饰

Prediction of m5C Modifications in RNA Sequences by Combining Multiple Sequence Features.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献