Su Wen-Xia, Li Qian-Zhong, Zhang Lu-Qiang, Fan Guo-Liang, Wu Cheng-Yan, Yan Zhen-He, Zuo Yong-Chun
Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
Gene. 2016 Oct 30;592(1):227-234. doi: 10.1016/j.gene.2016.07.059. Epub 2016 Jul 25.
Epigenetic factors are known to correlate with gene expression in the existing studies. However, quantitative models that accurately classify the highly and lowly expressed genes based on epigenetic factors are currently lacking. In this study, a new machine learning method combines histone modifications, DNA methylation, DNA accessibility, transcription factors, and trinucleotide composition with support vector machines (SVM) is developed in the context of human embryonic stem cell line (H1). The results indicate that the predictive accuracy will be markedly improved when the epigenetic features are considered. The predictive accuracy and Matthews correlation coefficient of the best model are as high as 95.96% and 0.92 for 10-fold cross-validation test, and 95.58% and 0.92 for independent dataset test, respectively. Our model provides a good way to judge a gene is either highly or lowly expressed gene by using genetic and epigenetic data, when the expression data of the gene is lacking. And a web-server GECES for our analysis method is established at http://202.207.14.87:8032/fuwu/GECES/index.asp, so that other scientists can easily get their desired results by our web-server, without going through the mathematical details.
在现有研究中,已知表观遗传因素与基因表达相关。然而,目前缺乏基于表观遗传因素准确分类高表达基因和低表达基因的定量模型。在本研究中,在人类胚胎干细胞系(H1)的背景下,开发了一种将组蛋白修饰、DNA甲基化、DNA可及性、转录因子和三核苷酸组成与支持向量机(SVM)相结合的新机器学习方法。结果表明,考虑表观遗传特征时,预测准确性将显著提高。最佳模型在10折交叉验证测试中的预测准确率和马修斯相关系数分别高达95.96%和0.92,在独立数据集测试中分别为95.58%和0.92。当缺乏基因表达数据时,我们的模型提供了一种利用遗传和表观遗传数据判断一个基因是高表达基因还是低表达基因的好方法。并且在http://202.207.14.87:8032/fuwu/GECES/index.asp建立了用于我们分析方法的网络服务器GECES,以便其他科学家无需了解数学细节,通过我们的网络服务器就能轻松获得他们想要的结果。