Liu Feng, Li Hao, Ren Chao, Bo Xiaochen, Shu Wenjie
Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China.
Sci Rep. 2016 Jun 22;6:28517. doi: 10.1038/srep28517.
Transcriptional enhancers are non-coding segments of DNA that play a central role in the spatiotemporal regulation of gene expression programs. However, systematically and precisely predicting enhancers remain a major challenge. Although existing methods have achieved some success in enhancer prediction, they still suffer from many issues. We developed a deep learning-based algorithmic framework named PEDLA (https://github.com/wenjiegroup/PEDLA), which can directly learn an enhancer predictor from massively heterogeneous data and generalize in ways that are mostly consistent across various cell types/tissues. We first trained PEDLA with 1,114-dimensional heterogeneous features in H1 cells, and demonstrated that PEDLA framework integrates diverse heterogeneous features and gives state-of-the-art performance relative to five existing methods for enhancer prediction. We further extended PEDLA to iteratively learn from 22 training cell types/tissues. Our results showed that PEDLA manifested superior performance consistency in both training and independent test sets. On average, PEDLA achieved 95.0% accuracy and a 96.8% geometric mean (GM) of sensitivity and specificity across 22 training cell types/tissues, as well as 95.7% accuracy and a 96.8% GM across 20 independent test cell types/tissues. Together, our work illustrates the power of harnessing state-of-the-art deep learning techniques to consistently identify regulatory elements at a genome-wide scale from massively heterogeneous data across diverse cell types/tissues.
转录增强子是DNA的非编码片段,在基因表达程序的时空调控中起着核心作用。然而,系统且精确地预测增强子仍然是一项重大挑战。尽管现有方法在增强子预测方面取得了一些成功,但它们仍然存在许多问题。我们开发了一种基于深度学习的算法框架PEDLA(https://github.com/wenjiegroup/PEDLA),它可以直接从大量异质数据中学习增强子预测器,并以在各种细胞类型/组织中大多一致的方式进行泛化。我们首先在H1细胞中用1114维异质特征训练PEDLA,并证明PEDLA框架整合了多种异质特征,相对于现有的五种增强子预测方法具有领先的性能。我们进一步将PEDLA扩展到从22种训练细胞类型/组织中进行迭代学习。我们的结果表明,PEDLA在训练集和独立测试集中均表现出卓越的性能一致性。平均而言,PEDLA在22种训练细胞类型/组织中实现了95.0%的准确率以及敏感性和特异性的96.8%几何平均值(GM),在20种独立测试细胞类型/组织中实现了95.7%的准确率和96.8%的GM。总之,我们的工作展示了利用先进深度学习技术从跨多种细胞类型/组织的大量异质数据中在全基因组范围内一致地识别调控元件的能力。