Departments of Chemical Engineering and Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard, Cambridge, MA 02139, USA.
Laboratoire de Physique Statistique de L'Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure & Université P.&M. Curie, Paris, France Computational and Quantitative Biology, UPMC, UMR 7238, Sorbonne Université, Paris, France.
Bioinformatics. 2016 Oct 15;32(20):3089-3097. doi: 10.1093/bioinformatics/btw328. Epub 2016 Jun 21.
Graphical models are often employed to interpret patterns of correlations observed in data through a network of interactions between the variables. Recently, Ising/Potts models, also known as Markov random fields, have been productively applied to diverse problems in biology, including the prediction of structural contacts from protein sequence data and the description of neural activity patterns. However, inference of such models is a challenging computational problem that cannot be solved exactly. Here, we describe the adaptive cluster expansion (ACE) method to quickly and accurately infer Ising or Potts models based on correlation data. ACE avoids overfitting by constructing a sparse network of interactions sufficient to reproduce the observed correlation data within the statistical error expected due to finite sampling. When convergence of the ACE algorithm is slow, we combine it with a Boltzmann Machine Learning algorithm (BML). We illustrate this method on a variety of biological and artificial datasets and compare it to state-of-the-art approximate methods such as Gaussian and pseudo-likelihood inference.
We show that ACE accurately reproduces the true parameters of the underlying model when they are known, and yields accurate statistical descriptions of both biological and artificial data. Models inferred by ACE more accurately describe the statistics of the data, including both the constrained low-order correlations and unconstrained higher-order correlations, compared to those obtained by faster Gaussian and pseudo-likelihood methods. These alternative approaches can recover the structure of the interaction network but typically not the correct strength of interactions, resulting in less accurate generative models.
The ACE source code, user manual and tutorials with the example data and filtered correlations described herein are freely available on GitHub at https://github.com/johnbarton/ACE CONTACTS: jpbarton@mit.edu, cocco@lps.ens.frSupplementary information: Supplementary data are available at Bioinformatics online.
图形模型常用于通过变量之间的相互作用网络来解释数据中观察到的相关模式。最近,伊辛/玻尔兹曼模型(也称为马尔可夫随机场)已成功应用于生物学中的各种问题,包括从蛋白质序列数据预测结构接触和描述神经活动模式。然而,这种模型的推断是一个具有挑战性的计算问题,无法精确求解。在这里,我们描述了自适应聚类扩展(ACE)方法,以基于相关数据快速准确地推断伊辛或玻尔兹曼模型。ACE 通过构建一个足够稀疏的相互作用网络来避免过度拟合,该网络足以在由于有限采样而导致的统计误差范围内再现观察到的相关数据。当 ACE 算法的收敛速度较慢时,我们将其与玻尔兹曼机器学习算法(BML)相结合。我们在各种生物和人工数据集上对此方法进行了说明,并将其与最新的近似方法(如高斯和伪似然推断)进行了比较。
当已知真实模型的参数时,ACE 准确地再现了真实模型的参数,并对生物和人工数据都进行了准确的统计描述。与更快的高斯和伪似然方法相比,ACE 推断出的模型更准确地描述了数据的统计信息,包括受约束的低阶相关和不受约束的高阶相关。这些替代方法可以恢复相互作用网络的结构,但通常不能恢复相互作用的正确强度,从而导致生成模型不够准确。
ACE 的源代码、用户手册和教程以及本文所述的示例数据和过滤相关信息均可在 GitHub 上免费获得,网址为 https://github.com/johnbarton/ACE。
jpbarton@mit.edu,cocco@lps.ens.fr
补充资料可在《生物信息学》在线获取。