Graduate Group in Genomics and Computational Biology, School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA; The Wistar Institute, Philadelphia, Pennsylvania 19104, USA;
The Wistar Institute, Philadelphia, Pennsylvania 19104, USA;
Genome Res. 2014 Jun;24(6):1039-50. doi: 10.1101/gr.166983.113. Epub 2014 Mar 27.
Mapping genome-wide data to human subtelomeres has been problematic due to the incomplete assembly and challenges of low-copy repetitive DNA elements. Here, we provide updated human subtelomere sequence assemblies that were extended by filling telomere-adjacent gaps using clone-based resources. A bioinformatic pipeline incorporating multiread mapping for annotation of the updated assemblies using short-read data sets was developed and implemented. Annotation of subtelomeric sequence features as well as mapping of CTCF and cohesin binding sites using ChIP-seq data sets from multiple human cell types confirmed that CTCF and cohesin bind within 3 kb of the start of terminal repeat tracts at many, but not all, subtelomeres. CTCF and cohesin co-occupancy were also enriched near internal telomere-like sequence (ITS) islands and the nonterminal boundaries of subtelomere repeat elements (SREs) in transformed lymphoblastoid cell lines (LCLs) and human embryonic stem cell (ES) lines, but were not significantly enriched in the primary fibroblast IMR90 cell line. Subtelomeric CTCF and cohesin sites predicted by ChIP-seq using our bioinformatics pipeline (but not predicted when only uniquely mapping reads were considered) were consistently validated by ChIP-qPCR. The colocalized CTCF and cohesin sites in SRE regions are candidates for mediating long-range chromatin interactions in the transcript-rich SRE region. A public browser for the integrated display of short-read sequence-based annotations relative to key subtelomere features such as the start of each terminal repeat tract, SRE identity and organization, and subtelomeric gene models was established.
由于人类亚端粒的不完全组装以及低拷贝重复 DNA 元件的挑战,将全基因组数据映射到人类亚端粒一直存在问题。在这里,我们提供了经过更新的人类亚端粒序列组装,这些组装通过使用基于克隆的资源填充端粒相邻的缺口得到了扩展。开发并实施了一个生物信息学管道,该管道结合了多读取映射,以便使用短读取数据集对更新的组装进行注释。使用来自多种人类细胞类型的 ChIP-seq 数据集对亚端粒序列特征的注释以及 CTCF 和黏合蛋白结合位点的映射,证实 CTCF 和黏合蛋白在许多(但不是所有)亚端粒的末端重复序列(TR)区段的起始处 3kb 内结合。CTCF 和黏合蛋白的共占据也在转化的淋巴母细胞系(LCL)和人胚胎干细胞(ES)系中的内部端粒样序列(ITS)岛和亚端粒重复元件(SRE)的非末端边界附近富集,但在原代成纤维细胞 IMR90 系中没有明显富集。通过我们的生物信息学管道进行 ChIP-seq 预测的亚端粒 CTCF 和黏合蛋白位点(而不是仅考虑唯一映射读取时预测的位点)通过 ChIP-qPCR 得到了一致验证。在 SRE 区域中,共定位的 CTCF 和黏合蛋白位点是介导富含转录物的 SRE 区域中长程染色质相互作用的候选者。建立了一个公共浏览器,用于整合显示短读取序列注释,这些注释与关键的亚端粒特征相关,如每个末端重复序列区的起始、SRE 身份和组织以及亚端粒基因模型。