School of Mathematics and Statistics, Xidian University, Xi'an, 710071, People's Republic of China.
Interdiscip Sci. 2021 Sep;13(3):413-425. doi: 10.1007/s12539-021-00429-4. Epub 2021 Apr 8.
DNA N6-methyladenine (6 mA), as an essential component of epigenetic modification, cannot be neglected in genetic regulation mechanism. The efficient and accurate prediction of 6 mA sites is beneficial to the development of biological genetics. Biochemical experimental methods are considered to be time-consuming and laborious. Most of the established machine learning methods have a single dataset. Although some of them have achieved cross-species prediction, their results are not satisfactory. Therefore, we designed a novel statistical model called i6mA-VC to improve the accuracy for 6 mA sites. On the one hand, kmer and binary encoding are applied to extract features, and then gradient boosting decision tree (GBDT) embedded method is applied as the feature selection strategy. On the other hand, DNA sequences are represented by vectors through the feature extraction method of ring-function-hydrogen-chemical properties (RFHCP) and the feature selection strategy of ExtraTree. After fusing the two optimal features, a voting classifier based on gradient boosting decision tree (GBDT), light gradient boosting machine (LightGBM) and multilayer perceptron classifier (MLPC) is constructed for final classification and prediction. The accuracy of Rice dataset and M.musculus dataset with five-fold cross-validation are 0.888 and 0.967, respectively. The cross-species dataset is selected as independent testing dataset, and the accuracy reaches 0.848. Through rigorous experiments, it is demonstrated that the proposed predictor is convincing and applicable. The development of i6mA-VC predictor will become an effective way for the recognition of N6-methyladenine sites, and it will also be beneficial for biological geneticists to further study gene expression and DNA modification. In addition, an accessible web-server for i6mA-VC is available from http://www.zhanglab.site/ .
DNA N6-甲基腺嘌呤(6mA)作为表观遗传修饰的重要组成部分,在遗传调控机制中不容忽视。高效准确地预测 6mA 位点有利于生物遗传学的发展。生化实验方法费时费力。大多数建立的机器学习方法只有一个数据集。虽然其中一些已经实现了跨物种预测,但结果并不理想。因此,我们设计了一种称为 i6mA-VC 的新型统计模型,以提高对 6mA 位点的预测准确性。一方面,kmer 和二进制编码用于提取特征,然后应用梯度提升决策树(GBDT)嵌入方法作为特征选择策略。另一方面,通过环函数-氢键-化学性质(RFHCP)的特征提取方法和 ExtraTree 的特征选择策略,将 DNA 序列表示为向量。融合两种最优特征后,构建基于梯度提升决策树(GBDT)、轻梯度提升机(LightGBM)和多层感知机分类器(MLPC)的投票分类器进行最终分类和预测。Rice 数据集和 M.musculus 数据集的五折交叉验证准确率分别为 0.888 和 0.967。选择跨物种数据集作为独立测试数据集,准确率达到 0.848。通过严格的实验,证明了所提出的预测器是令人信服和适用的。i6mA-VC 预测器的开发将成为识别 N6-甲基腺嘌呤位点的有效方法,也将有助于生物遗传学家进一步研究基因表达和 DNA 修饰。此外,i6mA-VC 的可访问网络服务器可从 http://www.zhanglab.site/ 获得。