Schultz Eric R, Kyhl Soren, Willett Rebecca, de Pablo Juan J
Pritzker School of Molecular Engineering, The University of Chicago, Chicago, Illinois, United States of America.
Department of Statistics and Computer Science, The University of Chicago, Chicago, Illinois, United States of America.
PLoS Comput Biol. 2025 Apr 9;21(4):e1012912. doi: 10.1371/journal.pcbi.1012912. eCollection 2025 Apr.
The physical organization of the genome in three-dimensional space regulates many biological processes, including gene expression and cell differentiation. Three-dimensional characterization of genome structure is critical to understanding these biological processes. Direct experimental measurements of genome structure are challenging; computational models of chromatin structure are therefore necessary. We develop an approach that combines a particle-based chromatin polymer model, molecular simulation, and machine learning to efficiently and accurately estimate chromatin structure from indirect measures of genome structure. More specifically, we introduce a new approach where the interaction parameters of the polymer model are extracted from experimental Hi-C data using a graph neural network (GNN). We train the GNN on simulated data from the underlying polymer model, avoiding the need for large quantities of experimental data. The resulting approach accurately estimates chromatin structures across all chromosomes and across several experimental cell lines despite being trained almost exclusively on simulated data. The proposed approach can be viewed as a general framework for combining physical modeling with machine learning, and it could be extended to integrate additional biological data modalities. Ultimately, we achieve accurate and high-throughput estimations of chromatin structure from Hi-C data, which will be necessary as experimental methodologies, such as single-cell Hi-C, improve.
基因组在三维空间中的物理组织调控着许多生物学过程,包括基因表达和细胞分化。基因组结构的三维特征对于理解这些生物学过程至关重要。对基因组结构进行直接实验测量具有挑战性;因此,染色质结构的计算模型是必要的。我们开发了一种方法,该方法结合了基于粒子的染色质聚合物模型、分子模拟和机器学习,以从基因组结构的间接测量中高效准确地估计染色质结构。更具体地说,我们引入了一种新方法,其中聚合物模型的相互作用参数使用图神经网络(GNN)从实验性的Hi-C数据中提取。我们在基础聚合物模型的模拟数据上训练GNN,从而无需大量实验数据。尽管几乎完全是在模拟数据上进行训练,但所得方法仍能准确估计所有染色体以及多个实验细胞系中的染色质结构。所提出的方法可被视为将物理建模与机器学习相结合的通用框架,并且可以扩展以整合其他生物学数据模式。最终,我们从Hi-C数据中实现了对染色质结构的准确且高通量的估计,随着诸如单细胞Hi-C等实验方法的改进,这将是必要的。