Suppr超能文献

多变量高斯模型中蛋白质多重序列比对的氨基酸数值编码。

Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments.

机构信息

Department of Computer Science, University of California, Davis, CA 95211, USA.

Institut de Physique Théorique, CEA Saclay, 91191 Gif-sur-Yvette CEDEX, France.

出版信息

Molecules. 2018 Dec 28;24(1):104. doi: 10.3390/molecules24010104.

Abstract

Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.

摘要

在空间上接近的蛋白质残基更容易发生共变,因为它们的相互作用由于结构和进化的限制可能被保留下来。如果我们能够检测和量化这种共变,那么仅从修饰它的序列就可以预测蛋白质的结构中的物理接触。为了进行这样的预测,并遵循其他人的工作,我们已经实现了一个多元高斯模型来分析多序列比对中的相关性。我们已经探索和测试了该模型中氨基酸的几种数值编码。我们已经表明,基于氨基酸生化和生物物理特性的 1D 编码,以及从实验得出的突变/取代矩阵的主成分计算得出的更高维编码,不如使用每个氨基酸在其自身维度上表示为一个向量的简单 20 维编码,其他维度均为零。基于取代矩阵的表示法获得的最优值是通过使用 10 到 12 个主成分来实现的;相应的性能不如使用 20 维二进制编码获得的性能。我们还强调了在构建多序列比对的多元高斯模型时先验的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4d44/6337344/ac6c7d751b6f/molecules-24-00104-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验