Engineering Research Center of Intelligent Control for Underground Space, Ministry of Education, China University of Mining and Technology, Xuzhou, 221116, China.
School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China.
BMC Bioinformatics. 2021 May 29;22(1):288. doi: 10.1186/s12859-021-04206-4.
As a common and abundant RNA methylation modification, N6-methyladenosine (mA) is widely spread in various species' transcriptomes, and it is closely related to the occurrence and development of various life processes and diseases. Thus, accurate identification of mA methylation sites has become a hot topic. Most biological methods rely on high-throughput sequencing technology, which places great demands on the sequencing library preparation and data analysis. Thus, various machine learning methods have been proposed to extract various types of features based on sequences, then occupied conventional classifiers, such as SVM, RF, etc., for mA methylation site identification. However, the identification performance relies heavily on the extracted features, which still need to be improved.
This paper mainly studies feature extraction and classification of mA methylation sites in a natural language processing way, which manages to organically integrate the feature extraction and classification simultaneously, with consideration of upstream and downstream information of mA sites. One-hot, RNA word embedding, and Word2vec are adopted to depict sites from the perspectives of the base as well as its upstream and downstream sequence. The BiLSTM model, a well-known sequence model, was then constructed to discriminate the sequences with potential mA sites. Since the above-mentioned three feature extraction methods focus on different perspectives of mA sites, an ensemble deep learning predictor (EDLmAPred) was finally constructed for mA site prediction. Experimental results on human and mouse data sets show that EDLmAPred outperforms the other single ones, indicating that base, upstream, and downstream information are all essential for mA site detection. Compared with the existing mA methylation site prediction models without genomic features, EDLmAPred obtains 86.6% of the area under receiver operating curve on the human data sets, indicating the effectiveness of sequential modeling on RNA. To maximize user convenience, a webserver was developed as an implementation of EDLmAPred and made publicly available at www.xjtlu.edu.cn/biologicalsciences/EDLm6APred .
Our proposed EDLmAPred method is a reliable predictor for mA methylation sites.
作为一种常见且丰富的 RNA 甲基化修饰,N6-甲基腺苷(m6A)广泛存在于各种物种的转录组中,与各种生命过程和疾病的发生发展密切相关。因此,准确识别 m6A 甲基化位点已成为研究热点。大多数生物学方法依赖于高通量测序技术,这对测序文库的制备和数据分析提出了很高的要求。因此,各种机器学习方法被提出,以基于序列提取各种类型的特征,然后占据传统的分类器,如 SVM、RF 等,用于 m6A 甲基化位点的识别。然而,识别性能严重依赖于所提取的特征,这些特征仍需要改进。
本文主要研究 m6A 甲基化位点的自然语言处理特征提取和分类,有机地将特征提取和分类同时进行,考虑了 m6A 位点的上下游信息。采用独热编码、RNA 单词嵌入和 Word2vec 从碱基及其上下游序列的角度来描述位点。然后,构建了著名的序列模型 BiLSTM 来区分具有潜在 m6A 位点的序列。由于上述三种特征提取方法关注 m6A 位点的不同视角,最终构建了一个集成深度学习预测器(EDLmAPred)用于 m6A 位点预测。在人类和小鼠数据集上的实验结果表明,EDLmAPred 优于其他单一方法,表明碱基、上游和下游信息对于 m6A 位点检测都是必不可少的。与没有基因组特征的现有 m6A 甲基化位点预测模型相比,EDLmAPred 在人类数据集上获得了 86.6%的接收器操作曲线下面积,表明在 RNA 上进行序列建模的有效性。为了最大程度地方便用户,我们开发了一个网络服务器作为 EDLmAPred 的实现,并在 www.xjtlu.edu.cn/biologicalsciences/EDLm6APred 上公开发布。
我们提出的 EDLmAPred 方法是一种可靠的 m6A 甲基化位点预测器。