Immunology Graduate Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America.
Systems Biology and Physiology Graduate Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America.
PLoS Comput Biol. 2023 Jan 31;19(1):e1010863. doi: 10.1371/journal.pcbi.1010863. eCollection 2023 Jan.
Transcription factors read the genome, fundamentally connecting DNA sequence to gene expression across diverse cell types. Determining how, where, and when TFs bind chromatin will advance our understanding of gene regulatory networks and cellular behavior. The 2017 ENCODE-DREAM in vivo Transcription-Factor Binding Site (TFBS) Prediction Challenge highlighted the value of chromatin accessibility data to TFBS prediction, establishing state-of-the-art methods for TFBS prediction from DNase-seq. However, the more recent Assay-for-Transposase-Accessible-Chromatin (ATAC)-seq has surpassed DNase-seq as the most widely-used chromatin accessibility profiling method. Furthermore, ATAC-seq is the only such technique available at single-cell resolution from standard commercial platforms. While ATAC-seq datasets grow exponentially, suboptimal motif scanning is unfortunately the most common method for TFBS prediction from ATAC-seq. To enable community access to state-of-the-art TFBS prediction from ATAC-seq, we (1) curated an extensive benchmark dataset (127 TFs) for ATAC-seq model training and (2) built "maxATAC", a suite of user-friendly, deep neural network models for genome-wide TFBS prediction from ATAC-seq in any cell type. With models available for 127 human TFs, maxATAC is the largest collection of high-performance TFBS prediction models for ATAC-seq. maxATAC performance extends to primary cells and single-cell ATAC-seq, enabling improved TFBS prediction in vivo. We demonstrate maxATAC's capabilities by identifying TFBS associated with allele-dependent chromatin accessibility at atopic dermatitis genetic risk loci.
转录因子读取基因组,从根本上连接 DNA 序列与不同细胞类型中的基因表达。确定转录因子如何、在何处以及何时与染色质结合,将有助于我们理解基因调控网络和细胞行为。2017 年 ENCODE-DREAM 体内转录因子结合位点(TFBS)预测挑战赛强调了染色质可及性数据在 TFBS 预测中的价值,确立了基于 DNase-seq 的 TFBS 预测的最新方法。然而,最近的 Assay-for-Transposase-Accessible-Chromatin(ATAC)-seq 已经超越了 DNase-seq,成为最广泛使用的染色质可及性分析方法。此外,ATAC-seq 是唯一可在标准商业平台上以单细胞分辨率获得的此类技术。虽然 ATAC-seq 数据集呈指数级增长,但不幸的是,从 ATAC-seq 预测 TFBS 最常见的方法是优化的基序扫描。为了使社区能够访问最先进的基于 ATAC-seq 的 TFBS 预测,我们(1)为 ATAC-seq 模型训练 curated 了一个广泛的基准数据集(127 个 TF),(2)构建了“maxATAC”,这是一套用户友好的、用于在任何细胞类型中从 ATAC-seq 进行全基因组 TFBS 预测的深度神经网络模型套件。有 127 个人类 TF 的模型可用,maxATAC 是最大的高性能 ATAC-seq TFBS 预测模型集合。maxATAC 的性能扩展到原代细胞和单细胞 ATAC-seq,能够提高体内 TFBS 预测的准确性。我们通过识别与特应性皮炎遗传风险位点相关的等位基因依赖性染色质可及性的 TFBS,展示了 maxATAC 的功能。