蛋白质家族的序列基序选择和生成型 Hopfield-Potts 模型。

Selection of sequence motifs and generative Hopfield-Potts models for protein families.

机构信息

Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative-LCQB, Paris, France.

出版信息

Phys Rev E. 2019 Sep;100(3-1):032128. doi: 10.1103/PhysRevE.100.032128.

DOI:10.1103/PhysRevE.100.032128

PMID:31639992

Abstract

Statistical models for families of evolutionary related proteins have recently gained interest: In particular, pairwise Potts models as those inferred by the direct-coupling analysis have been able to extract information about the three-dimensional structure of folded proteins and about the effect of amino acid substitutions in proteins. These models are typically requested to reproduce the one- and two-point statistics of the amino acid usage in a protein family, i.e., to capture the so-called residue conservation and covariation statistics of proteins of common evolutionary origin. Pairwise Potts models are the maximum-entropy models achieving this. Although being successful, these models depend on huge numbers of ad hoc introduced parameters, which have to be estimated from finite amounts of data and whose biophysical interpretation remains unclear. Here, we propose an approach to parameter reduction, which is based on selecting collective sequence motifs. It naturally leads to the formulation of statistical sequence models in terms of Hopfield-Potts models. These models can be accurately inferred using a mapping to restricted Boltzmann machines and persistent contrastive divergence. We show that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models. The Hopfield patterns form interpretable sequence motifs and may be used to clusterize amino acid sequences into functional subfamilies. However, the distributed collective nature of these motifs intrinsically limits the ability of Hopfield-Potts models in predicting contact maps, showing the necessity of developing models going beyond the Hopfield-Potts models discussed here.

摘要

最近，针对进化相关蛋白质家族的统计模型引起了人们的兴趣：特别是，直接耦合分析推断出的成对 Potts 模型已经能够提取有关折叠蛋白质三维结构以及蛋白质中氨基酸取代影响的信息。这些模型通常需要重现蛋白质家族中氨基酸使用的单点和两点统计信息，即捕获所谓的残基保守性和共有进化起源蛋白质的协变统计信息。成对 Potts 模型是实现这一目标的最大熵模型。尽管这些模型取得了成功，但它们依赖于大量的特定引入的参数，这些参数必须从有限的数据量中估计，并且其生物物理解释仍不清楚。在这里，我们提出了一种基于选择集体序列基序的参数减少方法。它自然导致了基于 Hopfield-Potts 模型的统计序列模型的表述。这些模型可以使用受限玻尔兹曼机和持久对比分歧的映射进行准确推断。我们表明，当应用于蛋白质数据时，即使选择 20-40 个模式也足以获得统计上接近生成模型的结果。Hopfield 模式形成可解释的序列基序，可用于将氨基酸序列聚类成功能亚家族。然而，这些模式的分布式集体性质本质上限制了 Hopfield-Potts 模型在预测接触图方面的能力，表明需要开发超越这里讨论的 Hopfield-Potts 模型的模型。