Wang Yiheng, Liu Tong, Xu Dong, Shi Huidong, Zhang Chaoyang, Mo Yin-Yuan, Wang Zheng
School of Computing, University of Southern Mississippi, 118 College Drive #5106, Hattiesburg, MS 39406, USA.
Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, 201 Engineering Building West, Columbia, MO 65211, USA.
Sci Rep. 2016 Jan 22;6:19598. doi: 10.1038/srep19598.
The hypo- or hyper-methylation of the human genome is one of the epigenetic features of leukemia. However, experimental approaches have only determined the methylation state of a small portion of the human genome. We developed deep learning based (stacked denoising autoencoders, or SdAs) software named "DeepMethyl" to predict the methylation state of DNA CpG dinucleotides using features inferred from three-dimensional genome topology (based on Hi-C) and DNA sequence patterns. We used the experimental data from immortalised myelogenous leukemia (K562) and healthy lymphoblastoid (GM12878) cell lines to train the learning models and assess prediction performance. We have tested various SdA architectures with different configurations of hidden layer(s) and amount of pre-training data and compared the performance of deep networks relative to support vector machines (SVMs). Using the methylation states of sequentially neighboring regions as one of the learning features, an SdA achieved a blind test accuracy of 89.7% for GM12878 and 88.6% for K562. When the methylation states of sequentially neighboring regions are unknown, the accuracies are 84.82% for GM12878 and 72.01% for K562. We also analyzed the contribution of genome topological features inferred from Hi-C. DeepMethyl can be accessed at http://dna.cs.usm.edu/deepmethyl/.
人类基因组的低甲基化或高甲基化是白血病的表观遗传特征之一。然而,实验方法仅确定了人类基因组一小部分的甲基化状态。我们开发了基于深度学习的(堆叠去噪自动编码器,即SdA)软件“DeepMethyl”,以利用从三维基因组拓扑结构(基于Hi-C)和DNA序列模式推断出的特征来预测DNA CpG二核苷酸的甲基化状态。我们使用来自永生化髓性白血病(K562)和健康淋巴母细胞(GM12878)细胞系的实验数据来训练学习模型并评估预测性能。我们测试了具有不同隐藏层配置和预训练数据量的各种SdA架构,并比较了深度网络相对于支持向量机(SVM)的性能。将连续相邻区域的甲基化状态用作学习特征之一时,一个SdA对GM12878的盲测准确率为89.7%,对K562的盲测准确率为88.6%。当连续相邻区域的甲基化状态未知时,GM12878的准确率为84.82%,K562的准确率为72.01%。我们还分析了从Hi-C推断出的基因组拓扑特征的贡献。可通过http://dna.cs.usm.edu/deepmethyl/访问DeepMethyl。