Department of Physics, University of Florida, Gainesville, Florida, United States of America.
Elanco Animal Health, Greenfield, Indiana, United States of America.
PLoS Comput Biol. 2021 Aug 6;17(8):e1009275. doi: 10.1371/journal.pcbi.1009275. eCollection 2021 Aug.
In modern computational biology, there is great interest in building probabilistic models to describe collections of a large number of co-varying binary variables. However, current approaches to build generative models rely on modelers' identification of constraints and are computationally expensive to infer when the number of variables is large (N~100). Here, we address both these issues with Super-statistical Generative Model for binary Data (SiGMoiD). SiGMoiD is a maximum entropy-based framework where we imagine the data as arising from super-statistical system; individual binary variables in a given sample are coupled to the same 'bath' whose intensive variables vary from sample to sample. Importantly, unlike standard maximum entropy approaches where modeler specifies the constraints, the SiGMoiD algorithm infers them directly from the data. Due to this optimal choice of constraints, SiGMoiD allows us to model collections of a very large number (N>1000) of binary variables. Finally, SiGMoiD offers a reduced dimensional description of the data, allowing us to identify clusters of similar data points as well as binary variables. We illustrate the versatility of SiGMoiD using multiple datasets spanning several time- and length-scales.
在现代计算生物学中,人们对构建概率模型来描述大量共变二进制变量的集合非常感兴趣。然而,当前构建生成模型的方法依赖于建模者识别约束,并且当变量数量很大(N~100)时,推断起来计算成本很高。在这里,我们通过 Super-statistical Generative Model for binary Data (SiGMoiD) 解决了这两个问题。SiGMoiD 是一个基于最大熵的框架,我们将数据想象为来自超统计系统;给定样本中的单个二进制变量与相同的“浴盆”耦合,浴盆的强度变量在样本之间变化。重要的是,与建模者指定约束的标准最大熵方法不同,SiGMoiD 算法直接从数据中推断出它们。由于这种最优约束选择,SiGMoiD 允许我们对非常大量(N>1000)的二进制变量进行建模。最后,SiGMoiD 提供了数据的降维描述,使我们能够识别相似数据点和二进制变量的聚类。我们使用跨越多个时间和长度尺度的多个数据集来说明 SiGMoiD 的多功能性。