TIMC, Université Grenoble Alpes, CNRS, Grenoble INP, Grenoble 38000, France.
Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac838.
We address the challenge of inferring a consensus 3D model of genome architecture from Hi-C data. Existing approaches most often rely on a two-step algorithm: first, convert the contact counts into distances, then optimize an objective function akin to multidimensional scaling (MDS) to infer a 3D model. Other approaches use a maximum likelihood approach, modeling the contact counts between two loci as a Poisson random variable whose intensity is a decreasing function of the distance between them. However, a Poisson model of contact counts implies that the variance of the data is equal to the mean, a relationship that is often too restrictive to properly model count data.
We first confirm the presence of overdispersion in several real Hi-C datasets, and we show that the overdispersion arises even in simulated datasets. We then propose a new model, called Pastis-NB, where we replace the Poisson model of contact counts by a negative binomial one, which is parametrized by a mean and a separate dispersion parameter. The dispersion parameter allows the variance to be adjusted independently from the mean, thus better modeling overdispersed data. We compare the results of Pastis-NB to those of several previously published algorithms, both MDS-based and statistical methods. We show that the negative binomial inference yields more accurate structures on simulated data, and more robust structures than other models across real Hi-C replicates and across different resolutions.
A Python implementation of Pastis-NB is available at https://github.com/hiclib/pastis under the BSD license.
Supplementary data are available at Bioinformatics online.
我们解决了从 Hi-C 数据推断基因组结构共识 3D 模型的挑战。现有的方法大多依赖于两步算法:首先,将接触计数转换为距离,然后优化类似于多维尺度(MDS)的目标函数来推断 3D 模型。其他方法使用最大似然方法,将两个位置之间的接触计数建模为泊松随机变量,其强度是它们之间距离的递减函数。然而,泊松接触计数模型意味着数据的方差等于均值,这种关系通常过于严格,无法正确地对计数数据进行建模。
我们首先在几个真实的 Hi-C 数据集上确认了过离散度的存在,并表明即使在模拟数据集中也存在过离散度。然后,我们提出了一个新的模型,称为 Pastis-NB,其中我们用负二项式模型替代了接触计数的泊松模型,该模型由均值和单独的离散参数参数化。离散参数允许方差独立于均值进行调整,从而更好地对过分散数据进行建模。我们将 Pastis-NB 的结果与以前发表的几种算法进行了比较,包括基于 MDS 和统计方法的算法。我们表明,负二项式推断在模拟数据上产生了更准确的结构,并且在真实的 Hi-C 重复和不同分辨率下比其他模型具有更稳健的结构。
Pastis-NB 的 Python 实现可在 https://github.com/hiclib/pastis 下获得,遵循 BSD 许可证。
补充数据可在 Bioinformatics 在线获得。