Liu Yongqi, Lin Shili
Department of Statistics, The Ohio State University, Columbus, OH 43210.
bioRxiv. 2025 Jun 17:2025.06.11.659235. doi: 10.1101/2025.06.11.659235.
Hi-C and single cell Hi-C (scHi-C) data are now routinely generated for studying an array of biological questions of interest, including whole genome chromatin organization to gain a better understanding of the chromosome three-dimensional hierarchical structure: compartments, Topologically Associated Domains (TADs), and long-range interactions. Due to concerns about data quality, especially for scHi-C because of its sparsity, data quality improvement is seen as a necessary step before performing analyses to answer biological questions. As such, methods have been developed accordingly, among them is a set of methods that are "random walk"- based, including random walk with a limited number of steps (RWS) and random walk with restart (RWR). Nevertheless, there is little justification for the use of such methods, nor quantification of their performance success. Taking correct identification of TADs as the end point, in this paper, we describe the characteristics of random-walk-based approaches and carry out empirical investigation for identifying TADs before and after random walks. Due to the lack of practical guidelines for choosing tuning parameters necessary for performing random walks, it is difficult to know how many steps of random walk for RWS or how small a restart probability for RWR should one choose to achieve good performance. Even in the unrealistic scenario when one has the hindsight of using the optimal parameter values, little improvement in downstream studies by first performing random walk was observed. This conclusion was based on extensive analytical analyses, simulation study, and real data applications. Therefore, the current study provides a cautionary note to researchers who may consider using random-walk-based approaches prior to downstream analyses.
Hi-C和单细胞Hi-C(scHi-C)数据现在经常被用于研究一系列感兴趣的生物学问题,包括全基因组染色质组织,以更好地理解染色体的三维层次结构:区室、拓扑相关结构域(TADs)和长程相互作用。由于对数据质量的担忧,特别是对于scHi-C因其稀疏性而言,在进行分析以回答生物学问题之前,数据质量的提高被视为必要步骤。因此,相应地开发了一些方法,其中包括一组基于“随机游走”的方法,包括有限步数随机游走(RWS)和带重启的随机游走(RWR)。然而,使用这些方法的理由很少,其性能成功也没有量化。以正确识别TADs为终点,在本文中,我们描述了基于随机游走方法的特点,并对随机游走前后识别TADs进行了实证研究。由于缺乏进行随机游走所需的调整参数的实用指南,很难知道RWS应选择多少步随机游走,或者RWR应选择多小的重启概率才能实现良好性能。即使在一种不切实际的情况下,即有人事后诸葛亮地使用最优参数值,先进行随机游走在下游研究中也几乎没有观察到改善。这一结论是基于广泛的分析分析、模拟研究和实际数据应用得出的。因此,本研究为那些可能考虑在下游分析之前使用基于随机游走方法的研究人员提供了一个警示。