Suppr超能文献

基于序列的多尺度建模用于高通量染色体构象捕获(Hi-C)数据分析。

Sequence-based multiscale modeling for high-throughput chromosome conformation capture (Hi-C) data analysis.

作者信息

Xia Kelin

机构信息

Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore.

School of Biological Sciences, Nanyang Technological University, Singapore 637371, Singapore.

出版信息

PLoS One. 2018 Feb 6;13(2):e0191899. doi: 10.1371/journal.pone.0191899. eCollection 2018.

Abstract

In this paper, we introduce sequence-based multiscale modeling for biomolecular data analysis. We employ spectral clustering method in our modeling and reveal the difference between sequence-based global scale clustering and local scale clustering. Essentially, two types of distances, i.e., Euclidean (or spatial) distance and genomic (or sequential) distance, can be used in data clustering. Clusters from sequence-based global scale models optimize spatial distances, meaning spatially adjacent loci are more likely to be assigned into the same cluster. Sequence-based local scale models, on the other hand, result in clusters that optimize genomic distances. That is to say, in these models, sequentially adjoining loci tend to be cluster together. We propose two sequence-based multiscale models (SeqMMs) for the study of chromosome hierarchical structures, including genomic compartments and topological associated domains (TADs). We find that genomic compartments are determined only by global scale information in the Hi-C data. The removal of all the local interactions within a band region as large as 10 Mb in genomic distance has almost no significant influence on the final compartment results. Further, in TAD analysis, we find that when the sequential scale is small, a tiny variation of diagonal band region in a contact map will result in a great change in the predicted TAD boundaries. When the scale value is larger than a threshold value, the TAD boundaries become very consistent. This threshold value is highly related to TAD sizes. By the comparison of our results with those previously obtained using a spectral clustering model, we find that our method is more robust and reliable. Finally, we demonstrate that almost all TAD boundaries from both clustering methods are local minimum of a TAD summation function.

摘要

在本文中,我们介绍了用于生物分子数据分析的基于序列的多尺度建模。我们在建模中采用谱聚类方法,并揭示了基于序列的全局尺度聚类和局部尺度聚类之间的差异。本质上,数据聚类中可以使用两种类型的距离,即欧几里得(或空间)距离和基因组(或序列)距离。基于序列的全局尺度模型的聚类优化空间距离,这意味着空间上相邻的位点更有可能被分配到同一聚类中。另一方面,基于序列的局部尺度模型产生的聚类优化基因组距离。也就是说,在这些模型中,顺序相邻的位点倾向于聚集在一起。我们提出了两种基于序列的多尺度模型(SeqMMs)用于研究染色体层次结构,包括基因组区室和拓扑相关结构域(TADs)。我们发现基因组区室仅由Hi-C数据中的全局尺度信息决定。在基因组距离高达10 Mb的条带区域内去除所有局部相互作用对最终的区室结果几乎没有显著影响。此外,在TAD分析中,我们发现当序列尺度较小时,接触图中对角带区域的微小变化将导致预测的TAD边界发生很大变化。当尺度值大于阈值时,TAD边界变得非常一致。这个阈值与TAD大小高度相关。通过将我们的结果与之前使用谱聚类模型获得的结果进行比较,我们发现我们的方法更稳健、更可靠。最后,我们证明了两种聚类方法得到的几乎所有TAD边界都是TAD求和函数的局部最小值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db23/5800693/d5c2577c23da/pone.0191899.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验