Hassanzadeh Hamid Reza, Wang May D
Department of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332.
Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, Georgia 30332.
Proceedings (IEEE Int Conf Bioinformatics Biomed). 2016 Dec;2016:178-183. doi: 10.1109/bibm.2016.7822515. Epub 2017 Jan 19.
Transcription factors (TFs) are macromolecules that bind to cis-regulatory specific sub-regions of DNA promoters and initiate transcription. Finding the exact location of these binding sites (aka motifs) is important in a variety of domains such as drug design and development. To address this need, several in vivo and in vitro techniques have been developed so far that try to characterize and predict the binding specificity of a protein to different DNA loci. The major problem with these techniques is that they are not accurate enough in prediction of the binding affinity and characterization of the corresponding motifs. As a result, downstream analysis is required to uncover the locations where proteins of interest bind. Here, we propose DeeperBind, a long short term recurrent convolutional network for prediction of protein binding specificities with respect to DNA probes. DeeperBind can model the positional dynamics of probe sequences and hence reckons with the contributions made by individual sub-regions in DNA sequences, in an effective way. Moreover, it can be trained and tested on datasets containing varying-length sequences. We apply our pipeline to the datasets derived from protein binding microarrays (PBMs), an in-vitro high-throughput technology for quantification of protein-DNA binding preferences, and present promising results. To the best of our knowledge, this is the most accurate pipeline that can predict binding specificities of DNA sequences from the data produced by high-throughput technologies through utilization of the power of deep learning for feature generation and positional dynamics modeling.
转录因子(TFs)是一类大分子,它们与DNA启动子的顺式调控特定子区域结合并启动转录。在药物设计与开发等多个领域中,找到这些结合位点(即基序)的确切位置至关重要。为满足这一需求,目前已开发出多种体内和体外技术,旨在表征和预测蛋白质与不同DNA位点的结合特异性。这些技术的主要问题在于,它们在预测结合亲和力和表征相应基序方面不够准确。因此,需要进行下游分析来揭示感兴趣蛋白质的结合位置。在此,我们提出了DeeperBind,这是一种长短期循环卷积网络,用于预测蛋白质相对于DNA探针的结合特异性。DeeperBind可以对探针序列的位置动态进行建模,从而有效地考虑DNA序列中各个子区域的贡献。此外,它可以在包含不同长度序列的数据集上进行训练和测试。我们将我们的流程应用于源自蛋白质结合微阵列(PBMs)的数据集,PBMs是一种用于定量蛋白质-DNA结合偏好的体外高通量技术,并呈现出了有前景的结果。据我们所知,这是最准确的流程,它能够通过利用深度学习进行特征生成和位置动态建模的能力,从高通量技术产生的数据中预测DNA序列的结合特异性。