Rizzato Francesca, Coucke Alice, de Leonardis Eleonora, Barton John P, Tubiana Jérôme, Monasson Rémi, Cocco Simona
Laboratoire de Physique de l'Ecole normale supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France.
Department of Physics and Astronomy, University of California, Riverside, 900 University Avenue, Riverside, California 92521, USA.
Phys Rev E. 2020 Jan;101(1-1):012309. doi: 10.1103/PhysRevE.101.012309.
We consider the problem of inferring a graphical Potts model on a population of variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization scheme, in which the number of Potts states (colors) available to each variable is reduced and interaction networks are made sparse. To achieve the color compression, only Potts states with large empirical frequency (exceeding some threshold) are explicitly modeled on each site, while the others are grouped into a single state. We benchmark the performances of this mixed regularization approach, with two inference algorithms, adaptive cluster expansion (ACE) and pseudolikelihood maximization (PLM), on synthetic data obtained by sampling disordered Potts models on Erdős-Rényi random graphs. We show in particular that color compression does not affect the quality of reconstruction of the parameters corresponding to high-frequency symbols, while drastically reducing the number of the other parameters and thus the computational time. Our procedure is also applied to multisequence alignments of protein families, with similar results.
我们考虑在一组变量上推断图形化Potts模型的问题。这个逆Potts问题通常涉及大量参数的推断,这些参数的数量常常大于可用数据的数量,因此,需要引入正则化。我们在此研究一种双重正则化方案,其中每个变量可用的Potts状态(颜色)数量减少,并且相互作用网络变得稀疏。为了实现颜色压缩,每个位点上仅对具有较大经验频率(超过某个阈值)的Potts状态进行显式建模,而其他状态则被归为单个状态。我们使用两种推理算法,即自适应聚类展开(ACE)和伪似然最大化(PLM),在通过对厄多斯-雷尼随机图上的无序Potts模型进行采样而获得的合成数据上,对这种混合正则化方法的性能进行基准测试。我们特别表明,颜色压缩不会影响与高频符号相对应的参数的重建质量,同时大幅减少其他参数的数量,从而减少计算时间。我们的方法也应用于蛋白质家族的多序列比对,得到了类似的结果。