Department of Statistics & Data Science, Yale University, New Haven, CT 06520, USA.
Department of Computer Science, University of California, Irvine, CA 92617, USA.
Bioinformatics. 2021 Jul 12;37(Suppl_1):i280-i288. doi: 10.1093/bioinformatics/btab283.
Mapping distal regulatory elements, such as enhancers, is a cornerstone for elucidating how genetic variations may influence diseases. Previous enhancer-prediction methods have used either unsupervised approaches or supervised methods with limited training data. Moreover, past approaches have implemented enhancer discovery as a binary classification problem without accurate boundary detection, producing low-resolution annotations with superfluous regions and reducing the statistical power for downstream analyses (e.g. causal variant mapping and functional validations). Here, we addressed these challenges via a two-step model called Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays (DECODE). First, we employed direct enhancer-activity readouts from novel functional characterization assays, such as STARR-seq, to train a deep neural network for accurate cell-type-specific enhancer prediction. Second, to improve the annotation resolution, we implemented a weakly supervised object detection framework for enhancer localization with precise boundary detection (to a 10 bp resolution) using Gradient-weighted Class Activation Mapping.
Our DECODE binary classifier outperformed a state-of-the-art enhancer prediction method by 24% in transgenic mouse validation. Furthermore, the object detection framework can condense enhancer annotations to only 13% of their original size, and these compact annotations have significantly higher conservation scores and genome-wide association study variant enrichments than the original predictions. Overall, DECODE is an effective tool for enhancer classification and precise localization.
DECODE source code and pre-processing scripts are available at decode.gersteinlab.org.
Supplementary data are available at Bioinformatics online.
绘制远端调控元件,如增强子,是阐明遗传变异如何影响疾病的基石。以前的增强子预测方法要么使用无监督方法,要么使用有限训练数据的监督方法。此外,过去的方法将增强子发现实现为一个没有准确边界检测的二进制分类问题,产生具有多余区域的低分辨率注释,并降低下游分析(例如因果变异映射和功能验证)的统计能力。在这里,我们通过一个两步模型来解决这些挑战,该模型称为利用大规模功能测定进行浓缩增强子和细化边界的深度学习框架(DECODE)。首先,我们利用新型功能特征测定试验(如 STARR-seq)中的直接增强子活性读数,来训练一个深度神经网络,以进行准确的细胞类型特异性增强子预测。其次,为了提高注释分辨率,我们实施了一个弱监督的目标检测框架,使用梯度加权类激活映射进行精确边界检测(分辨率为 10 bp),以实现增强子定位。
我们的 DECODE 二进制分类器在转基因小鼠验证中比最先进的增强子预测方法高出 24%。此外,目标检测框架可以将增强子注释压缩到原始大小的 13%,并且这些紧凑的注释比原始预测具有更高的保守评分和全基因组关联研究变异富集。总的来说,DECODE 是一种有效的增强子分类和精确定位工具。
DECODE 的源代码和预处理脚本可在 decode.gersteinlab.org 获得。
补充数据可在生物信息学在线获得。