Solis Armando D
Biological Sciences Department, New York City College of Technology, the City University of New York (CUNY), Brooklyn, New York, 11201.
Proteins. 2015 Dec;83(12):2198-216. doi: 10.1002/prot.24936.
To reduce complexity, understand generalized rules of protein folding, and facilitate de novo protein design, the 20-letter amino acid alphabet is commonly reduced to a smaller alphabet by clustering amino acids based on some measure of similarity. In this work, we seek the optimal alphabet that preserves as much of the structural information found in long-range (contact) interactions among amino acids in natively-folded proteins. We employ the Information Maximization Device, based on information theory, to partition the amino acids into well-defined clusters. Numbering from 2 to 19 groups, these optimal clusters of amino acids, while generated automatically, embody well-known properties of amino acids such as hydrophobicity/polarity, charge, size, and aromaticity, and are demonstrated to maintain the discriminative power of long-range interactions with minimal loss of mutual information. Our measurements suggest that reduced alphabets (of less than 10) are able to capture virtually all of the information residing in native contacts and may be sufficient for fold recognition, as demonstrated by extensive threading tests. In an expansive survey of the literature, we observe that alphabets derived from various approaches-including those derived from physicochemical intuition, local structure considerations, and sequence alignments of remote homologs-fare consistently well in preserving contact interaction information, highlighting a convergence in the various factors thought to be relevant to the folding code. Moreover, we find that alphabets commonly used in experimental protein design are nearly optimal and are largely coherent with observations that have arisen in this work.
为了降低复杂性、理解蛋白质折叠的一般规则并促进从头蛋白质设计,通常通过基于某种相似性度量对氨基酸进行聚类,将由20种字母组成的氨基酸字母表简化为更小的字母表。在这项工作中,我们寻找最优字母表,以保留天然折叠蛋白质中氨基酸之间长程(接触)相互作用中发现的尽可能多的结构信息。我们采用基于信息论的信息最大化装置,将氨基酸划分为定义明确的簇。从2到19个组进行编号,这些最优的氨基酸簇虽然是自动生成的,但体现了氨基酸的众所周知的性质,如疏水性/极性、电荷、大小和芳香性,并被证明在最小化互信息损失的情况下保持长程相互作用的判别能力。我们的测量表明,减少后的字母表(少于10个)能够捕获几乎所有存在于天然接触中的信息,并且如广泛的穿线测试所示,可能足以用于折叠识别。在对文献的广泛调查中,我们观察到,从各种方法中得出的字母表——包括那些从物理化学直觉、局部结构考虑以及远缘同源物的序列比对中得出的字母表——在保留接触相互作用信息方面一直表现良好,这突出了被认为与折叠密码相关的各种因素的趋同性。此外,我们发现实验性蛋白质设计中常用的字母表几乎是最优的,并且在很大程度上与这项工作中出现的观察结果一致。