Suppr超能文献

通过玻尔兹曼机的参数约简进行稀疏生成建模:在蛋白质序列家族中的应用。

Sparse generative modeling via parameter reduction of Boltzmann machines: Application to protein-sequence families.

作者信息

Barrat-Charlaix Pierre, Muntoni Anna Paola, Shimagaki Kai, Weigt Martin, Zamponi Francesco

机构信息

Biozentrum, Universität Basel, Switzerland, Swiss Institute of Bioinformatics, Basel 4056, Switzerland.

Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino 10129, Italy.

出版信息

Phys Rev E. 2021 Aug;104(2-1):024407. doi: 10.1103/PhysRevE.104.024407.

Abstract

Boltzmann machines (BMs) are widely used as generative models. For example, pairwise Potts models (PMs), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and generating new functional sequences. However, the resulting PM suffers from important overfitting effects: many couplings are small, noisy, and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than 90% of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.

摘要

玻尔兹曼机(BMs)被广泛用作生成模型。例如,作为BM类实例的成对Potts模型(PMs),为进化相关蛋白质序列家族提供了准确的统计模型。其参数是局部场,用于描述氨基酸保守性的位点特异性模式,以及两位点耦合,反映位点对之间的协同进化。这种协同进化反映了进化过程中作用于蛋白质序列的结构和功能限制。描述协同进化信号最保守的选择是将所有可能的两位点耦合纳入PM。这种选择是所谓直接耦合分析的典型做法,在预测三维结构中的残基接触、突变效应以及生成新的功能序列方面取得了成功。然而,由此产生的PM存在重要的过拟合效应:许多耦合很小、有噪声且难以解释;PM接近临界点,这意味着它对小的参数扰动高度敏感。在这项工作中,我们通过对统计意义较小的耦合进行受控迭代抽取,为BMs引入了一种通用的参数约简程序,该程序由一种基于信息的标准确定,该标准选择弱耦合或统计上无支持的耦合。对于几个蛋白质家族,我们的程序允许去除超过90%的PM耦合,同时保留原始密集PM的预测和生成特性,并且得到的模型远离临界点,因此对噪声更具鲁棒性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验