Zhang Shengli, Yu Qianhao, He Haoran, Zhu Fu, Wu Panjing, Gu Lingzhi, Jiang Sijie
School of Mathematics and Statistics, Xidian University, Xi'an 710071, PR China.
School of Artificial Intelligence, Xidian University, Xi'an 710071, PR China.
Genomics. 2020 Mar;112(2):1282-1289. doi: 10.1016/j.ygeno.2019.07.017. Epub 2019 Aug 1.
DNase I hypersensitive site (DHS) is related to DNA regulatory elements, so the understanding of DHS sites is of great significance for biomedical research. However, traditional experiments are not very good at identifying recombinant sites of a large number of emerging DNA sequences by sequencing. Some machine learning methods have been proposed to identify DHS, but most methods ignore spatial autocorrelation of the DNA sequence. In this paper, we proposed a predictor called iDHS-DSAMS to identify DHS based on the benchmark datasets. We develop a feature extraction method called dinucleotide-based spatial autocorrelation (DSA). Then we use Min-Redundancy-Max-Relevance (mRMR) to remove irrelevant and redundant features and a 100-dimensional feature vector is selected. Finally, we utilize ensemble bagged tree as classifier, which is based on the oversampled datasets using SMOTE. Five-fold cross validation tests on two benchmark datasets indicate that the proposed method outperforms its existing counterparts on the individual accuracy (Acc), Matthews correlation coefficient (MCC), sensitivity (Sn) and specificity (Sp).
脱氧核糖核酸酶I超敏位点(DHS)与DNA调控元件相关,因此对DHS位点的理解对生物医学研究具有重要意义。然而,传统实验在通过测序识别大量新出现的DNA序列的重组位点方面并不十分擅长。已经提出了一些机器学习方法来识别DHS,但大多数方法都忽略了DNA序列的空间自相关性。在本文中,我们提出了一种名为iDHS-DSAMS的预测器,用于基于基准数据集识别DHS。我们开发了一种名为基于二核苷酸的空间自相关性(DSA)的特征提取方法。然后我们使用最小冗余最大相关(mRMR)来去除不相关和冗余的特征,并选择一个100维的特征向量。最后,我们利用集成袋装树作为分类器,它基于使用SMOTE的过采样数据集。在两个基准数据集上进行的五折交叉验证测试表明,所提出的方法在个体准确率(Acc)、马修斯相关系数(MCC)、灵敏度(Sn)和特异性(Sp)方面优于现有的同类方法。