Quach Bryan, Furey Terrence S
Curriculum in Bioinformatics and Computational Biology.
Department of Genetics.
Bioinformatics. 2017 Apr 1;33(7):956-963. doi: 10.1093/bioinformatics/btw740.
Identifying the locations of transcription factor binding sites is critical for understanding how gene transcription is regulated across different cell types and conditions. Chromatin accessibility experiments such as DNaseI sequencing (DNase-seq) and Assay for Transposase Accessible Chromatin sequencing (ATAC-seq) produce genome-wide data that include distinct 'footprint' patterns at binding sites. Nearly all existing computational methods to detect footprints from these data assume that footprint signals are highly homogeneous across footprint sites. Additionally, a comprehensive and systematic comparison of footprinting methods for specifically identifying which motif sites for a specific factor are bound has not been performed.
Using DNase-seq data from the ENCODE project, we show that a large degree of previously uncharacterized site-to-site variability exists in footprint signal across motif sites for a transcription factor. To model this heterogeneity in the data, we introduce a novel, supervised learning footprinter called Detecting Footprints Containing Motifs (DeFCoM). We compare DeFCoM to nine existing methods using evaluation sets from four human cell-lines and eighteen transcription factors and show that DeFCoM outperforms current methods in determining bound and unbound motif sites. We also analyze the impact of several biological and technical factors on the quality of footprint predictions to highlight important considerations when conducting footprint analyses and assessing the performance of footprint prediction methods. Finally, we show that DeFCoM can detect footprints using ATAC-seq data with similar accuracy as when using DNase-seq data.
Python code available at https://bitbucket.org/bryancquach/defcom.
bquach@email.unc.edu or tsfurey@email.unc.edu.
Supplementary data are available at Bioinformatics online.
识别转录因子结合位点的位置对于理解基因转录如何在不同细胞类型和条件下受到调控至关重要。诸如DNA酶I测序(DNase-seq)和转座酶可及染色质测序分析(ATAC-seq)等染色质可及性实验会产生全基因组数据,这些数据在结合位点处包含独特的“足迹”模式。几乎所有现有的从这些数据中检测足迹的计算方法都假定足迹信号在足迹位点之间高度均匀。此外,尚未对用于特异性识别特定因子结合的基序位点的足迹方法进行全面系统的比较。
使用来自ENCODE项目的DNase-seq数据,我们表明转录因子的基序位点之间的足迹信号存在很大程度的先前未表征的位点间变异性。为了对数据中的这种异质性进行建模,我们引入了一种新颖的监督学习足迹识别器,称为检测含基序足迹(DeFCoM)。我们使用来自四种人类细胞系和十八种转录因子的评估集将DeFCoM与九种现有方法进行比较,结果表明DeFCoM在确定结合和未结合的基序位点方面优于当前方法。我们还分析了几个生物学和技术因素对足迹预测质量的影响,以突出进行足迹分析和评估足迹预测方法性能时的重要考虑因素。最后,我们表明DeFCoM使用ATAC-seq数据检测足迹的准确性与使用DNase-seq数据时相似。
Python代码可在https://bitbucket.org/bryancquach/defcom获取。
bquach@email.unc.edu或tsfurey@email.unc.edu。
补充数据可在《生物信息学》在线获取。