Wang Xiaofeng, Yan Renxiang
College of Mathematics and Computer Science, Shanxi Normal University, Linfen, 041004, China.
Institute of Applied Genomics, School of Biological Sciences and Engineering, Fuzhou University, Fuzhou, 350002, China.
Plant Mol Biol. 2018 Feb;96(3):327-337. doi: 10.1007/s11103-018-0698-9. Epub 2018 Jan 16.
We curated a reliable dataset of mA sites in Arabidopsis thaliana, built competitive models for predicting mA sites, extracted predominant rules from the prediction models and analyzed the most important features. In biological RNA, approximately 150 chemical modifications have been discovered, of which N-methyladenine (mA) is the most prevalent and abundant. This modification plays an essential role in a myriad of biological mechanisms and regulates RNA localization, nuclear export, translation, stability, alternative splicing, and other processes. However, mA-seq and other wet-lab techniques do not easily facilitate accurate and complete determination of mA sites across the transcriptome. Therefore, the use of computational methods to establish accurate models for predicting mA sites is essential. In this work, we manually curated a reliable dataset of mA sites and non-mA sites and developed a new tool called RFAthM6A for predicting mA sites in Arabidopsis thaliana. Briefly, RFAthM6A consists of four independent models named RFPSNSP, RFPSDSP, RFKSNPF and RFKNF and strict benchmarks show that the AUC values of the four models reached 0.894, 0.914, 0.920 and 0.926, respectively in a fivefold cross validation and the prediction performance of RFPSDSP, RFKSNPF and RFKNF exceeded that of three previously reported models (AthMethPre, M6ATH and RAM-NPPS). Linear combination of the prediction scores of RFPSDSP, RFKSNPF and RFKNF improved the prediction performance. We also extracted several predominant rules that underlie the mA site identification from the trained models. Furthermore, the most important features of the predictors for the mA site identification were also analyzed in depth. To facilitate use of our proposed models by interested researchers, all the source codes and datasets are publicly deposited at https://github.com/nongdaxiaofeng/RFAthM6A .
我们精心整理了拟南芥中 mA 位点的可靠数据集,构建了用于预测 mA 位点的竞争模型,从预测模型中提取了主要规则,并分析了最重要的特征。在生物 RNA 中,已发现约 150 种化学修饰,其中 N - 甲基腺嘌呤(mA)最为普遍和丰富。这种修饰在众多生物机制中起着至关重要的作用,并调节 RNA 定位、核输出、翻译、稳定性、可变剪接及其他过程。然而,mA - seq 和其他湿实验室技术不易于准确且完整地确定转录组中的 mA 位点。因此,使用计算方法建立准确的 mA 位点预测模型至关重要。在这项工作中,我们手动整理了 mA 位点和非 mA 位点的可靠数据集,并开发了一种名为 RFAthM6A 的新工具来预测拟南芥中的 mA 位点。简而言之,RFAthM6A 由四个独立模型组成,分别名为 RFPSNSP、RFPSDSP、RFKSNPF 和 RFKNF,严格的基准测试表明,在五折交叉验证中,这四个模型的 AUC 值分别达到 0.894、0.914、0.920 和 0.926,并且 RFPSDSP、RFKSNPF 和 RFKNF 的预测性能超过了之前报道的三个模型(AthMethPre、M6ATH 和 RAM - NPPS)。RFPSDSP、RFKSNPF 和 RFKNF 的预测分数的线性组合提高了预测性能。我们还从训练模型中提取了几个构成 mA 位点识别基础的主要规则。此外,还深入分析了 mA 位点识别预测器的最重要特征。为方便感兴趣的研究人员使用我们提出的模型,所有源代码和数据集都已公开存放在 https://github.com/nongdaxiaofeng/RFAthM6A 。