Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
Genome Res. 2024 Jul 23;34(6):937-951. doi: 10.1101/gr.278638.123.
Transposable elements (TEs) and other repetitive regions have been shown to contain gene regulatory elements, including transcription factor binding sites. However, regulatory elements harbored by repeats have proven difficult to characterize using short-read sequencing assays such as ChIP-seq or ATAC-seq. Most regulatory genomics analysis pipelines discard "multimapped" reads that align equally well to multiple genomic locations. Because multimapped reads arise predominantly from repeats, current analysis pipelines fail to detect a substantial portion of regulatory events that occur in repetitive regions. To address this shortcoming, we developed Allo, a new approach to allocate multimapped reads in an efficient, accurate, and user-friendly manner. Allo combines probabilistic mapping of multimapped reads with a convolutional neural network that recognizes the read distribution features of potential peaks, offering enhanced accuracy in multimapping read assignment. Allo also provides read-level output in the form of a corrected alignment file, making it compatible with existing regulatory genomics analysis pipelines and downstream peak-finders. In a demonstration application on CTCF ChIP-seq data, we show that Allo results in the discovery of thousands of new CTCF peaks. Many of these peaks contain the expected cognate motif and/or serve as TAD boundaries. We additionally apply Allo to a diverse collection of ENCODE ChIP-seq data sets, resulting in multiple previously unidentified interactions between transcription factors and repetitive element families. Finally, we show that Allo may be particularly beneficial in identifying ChIP-seq peaks at centromeres, near segmentally duplicated genes, and in younger TEs, enabling new regulatory analyses in these regions.
转座元件 (TEs) 和其他重复区域已被证明包含基因调控元件,包括转录因子结合位点。然而,使用短读测序技术(如 ChIP-seq 或 ATAC-seq)难以对重复序列中的调控元件进行特征描述。大多数调控基因组学分析管道会丢弃与多个基因组位置匹配得同样好的“多映射”reads。由于多映射 reads 主要来自重复序列,因此当前的分析管道无法检测到在重复区域中发生的大量调控事件。为了解决这个问题,我们开发了一种新方法 Allo,以有效、准确、用户友好的方式分配多映射 reads。Allo 将多映射 reads 的概率映射与卷积神经网络相结合,该网络识别潜在峰的 read 分布特征,从而在多映射 read 分配方面提供更高的准确性。Allo 还以校正对齐文件的形式提供 read 级别的输出,使其与现有的调控基因组学分析管道和下游峰查找器兼容。在 CTCF ChIP-seq 数据的演示应用中,我们表明 Allo 可以发现数千个新的 CTCF 峰。其中许多峰包含预期的同源基序,或作为 TAD 边界。我们还将 Allo 应用于各种 ENCODE ChIP-seq 数据集,导致转录因子和重复元件家族之间的多个以前未识别的相互作用。最后,我们表明 Allo 可能特别有助于识别着丝粒、分段重复基因附近以及年轻 TEs 中的 ChIP-seq 峰,从而在这些区域进行新的调控分析。