Shen Yihang, Yu Lingge, Qiu Yutong, Zhang Tianyu, Kingsford Carl
Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA.
Department of Statistics and Data Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA.
bioRxiv. 2023 Nov 10:2023.11.08.566275. doi: 10.1101/2023.11.08.566275.
Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes. However, the linear reference genome does not incorporate genetic variation information, which can lead to incorrect read alignments, especially when analyzing samples with substantial genomic differences from the reference such as cancer samples. Using genome graphs as the reference facilitates more accurate mapping of reads, however, new algorithms are required for inferring linear genomes from Hi-C reads mapped on genome graphs and constructing corresponding Hi-C contact matrices, which is a prerequisite for the subsequent steps of the Hi-C analysis such as identifying topologically associated domains and calling chromatin loops. We introduce the problem of genome sequence inference from Hi-C data mediated by genome graphs. We formalize this problem, show the hardness of solving this problem, and introduce a novel heuristic algorithm specifically tailored to this problem. We provide a theoretical analysis to evaluate the efficacy of our algorithm. Finally, our empirical experiments indicate that the linear genomes inferred from our method lead to the creation of improved Hi-C contact matrices. These enhanced matrices show a reduction in erroneous patterns caused by structural variations and are more effective in accurately capturing the structures of topologically associated domains.
三维染色体结构在基因组基本功能中起着重要作用。Hi-C是一种基于高通量测序的技术,极大地扩展了我们对三维染色体结构的理解。Hi-C分析流程的第一步涉及将Hi-C测序读数映射到线性参考基因组。然而,线性参考基因组并未纳入遗传变异信息,这可能导致错误的读数比对,特别是在分析与参考基因组存在显著基因组差异的样本(如癌症样本)时。使用基因组图作为参考有助于更准确地映射读数,然而,需要新的算法来从映射在基因组图上的Hi-C读数推断线性基因组并构建相应的Hi-C接触矩阵,这是Hi-C分析后续步骤(如识别拓扑相关结构域和识别染色质环)的先决条件。我们介绍了由基因组图介导的从Hi-C数据推断基因组序列的问题。我们对这个问题进行了形式化,展示了解决该问题的难度,并引入了一种专门针对此问题定制的新颖启发式算法。我们提供了理论分析来评估我们算法的有效性。最后,我们的实证实验表明,从我们的方法推断出的线性基因组能够创建改进的Hi-C接触矩阵。这些增强的矩阵显示出由结构变异引起的错误模式减少,并且在准确捕获拓扑相关结构域的结构方面更有效。