Department of Mathematics and Computational Science, Xiangtan University, Xiangtan, 411105, Hunan, China.
Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, Hunan, China.
BMC Bioinformatics. 2021 Jun 2;22(Suppl 6):129. doi: 10.1186/s12859-021-04006-w.
Nucleosome plays an important role in the process of genome expression, DNA replication, DNA repair and transcription. Therefore, the research of nucleosome positioning has invariably received extensive attention. Considering the diversity of DNA sequence representation methods, we tried to integrate multiple features to analyze its effect in the process of nucleosome positioning analysis. This process can also deepen our understanding of the theoretical analysis of nucleosome positioning.
Here, we not only used frequency chaos game representation (FCGR) to construct DNA sequence features, but also integrated it with other features and adopted the principal component analysis (PCA) algorithm. Simultaneously, support vector machine (SVM), extreme learning machine (ELM), extreme gradient boosting (XGBoost), multilayer perceptron (MLP) and convolutional neural networks (CNN) are used as predictors for nucleosome positioning prediction analysis, respectively. The integrated feature vector prediction quality is significantly superior to a single feature. After using principal component analysis (PCA) to reduce the feature dimension, the prediction quality of H. sapiens dataset has been significantly improved.
Comparative analysis and prediction on H. sapiens, C. elegans, D. melanogaster and S. cerevisiae datasets, demonstrate that the application of FCGR to nucleosome positioning is feasible, and we also found that integrative feature representation would be better.
核小体在基因组表达、DNA 复制、DNA 修复和转录过程中起着重要作用。因此,核小体定位的研究一直受到广泛关注。考虑到 DNA 序列表示方法的多样性,我们试图整合多种特征来分析其在核小体定位分析过程中的作用。这一过程也可以加深我们对核小体定位理论分析的理解。
在这里,我们不仅使用了频率混沌游戏表示(FCGR)来构建 DNA 序列特征,还将其与其他特征相结合,并采用主成分分析(PCA)算法。同时,支持向量机(SVM)、极限学习机(ELM)、极端梯度提升(XGBoost)、多层感知机(MLP)和卷积神经网络(CNN)分别作为核小体定位预测分析的预测器。集成特征向量的预测质量明显优于单个特征。在用主成分分析(PCA)降低特征维数后,H. sapiens 数据集的预测质量得到了显著提高。
对 H. sapiens、C. elegans、D. melanogaster 和 S. cerevisiae 数据集的比较分析和预测表明,FCGR 应用于核小体定位是可行的,我们还发现综合特征表示会更好。