Gong Haiyan, Yang Yi, Zhang Xiaotong, Li Minghong, Zhang Sichen, Chen Yang
School of Computer and Communication Engineering, Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing Advanced Innovation Center for Materials Genome Engineering, University of Science and Technology Beijing, Beijing 100083, China.
Shunde Graduate School, University of Science and Technology Beijing, Foshan 528399, Guangdong, China.
Comput Struct Biotechnol J. 2022 Sep 5;20:4816-4824. doi: 10.1016/j.csbj.2022.08.059. eCollection 2022.
With the development of Hi-C technology, the detection of topologically associated domains (TADs) boundaries plays an important role in exploring the relationship between gene structure and expression. However, a method that can identify accurate TAD boundaries from the Hi-C contact matrix with different resolutions is currently lacking. We proposed a method named CASPIAN that can identify chromatin TAD boundaries based on the spatial density clustering algorithm. CASPIAN requires few parameters to call TADs. This method is realized using the hierarchical density-based clustering method HDBSCAN, where the distance of pairwise bins is calculated based on three distance metrics (Euclidean, Manhattan, and Chebyshev distance metric) to adapt to the characteristics of the Hi-C contact matrix generated from simulation experiments or normalized methods. Our results show that, same as standard methods (e.g., Insulation Score, TopDom), CASPIAN can enrich factors related to promoting the gene expression, such as CTCF, H3K4me1, H3K4me3, RAD21, POLR2A, and SMC3. We also calculated the approximate proportion of various factors anchored at the TAD boundaries to observe the distribution of these factors surrounding the TAD boundaries. In conclusion, CASPIAN is an easy method to explore the relationship between transcription factors and TAD boundaries. CASPIAN is available online (https://gitee.com/ghaiyan/caspian).
随着Hi-C技术的发展,拓扑相关结构域(TADs)边界的检测在探索基因结构与表达之间的关系中起着重要作用。然而,目前缺乏一种能够从不同分辨率的Hi-C接触矩阵中识别准确TAD边界的方法。我们提出了一种名为CASPIAN的方法,该方法可以基于空间密度聚类算法识别染色质TAD边界。CASPIAN调用TADs所需的参数很少。该方法是使用基于密度的层次聚类方法HDBSCAN实现的,其中基于三种距离度量(欧几里得距离、曼哈顿距离和切比雪夫距离度量)计算成对区间的距离,以适应从模拟实验或归一化方法生成的Hi-C接触矩阵的特征。我们的结果表明,与标准方法(如绝缘分数、TopDom)一样,CASPIAN可以富集与促进基因表达相关的因子,如CTCF、H3K4me1、H3K4me3、RAD21、POLR2A和SMC3。我们还计算了锚定在TAD边界的各种因子的近似比例,以观察这些因子在TAD边界周围的分布。总之,CASPIAN是一种探索转录因子与TAD边界之间关系的简便方法。CASPIAN可在线获取(https://gitee.com/ghaiyan/caspian)。