Neuwald Andrew F, Altschul Stephen F
Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, BioPark II, Room 617, Baltimore, MD, United States of America.
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America.
PLoS Comput Biol. 2016 Dec 21;12(12):e1005294. doi: 10.1371/journal.pcbi.1005294. eCollection 2016 Dec.
Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes' theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu).
在进化过程中,具有共同结构核心的同源蛋白质超家族成员会分化为填充各种功能生态位的亚组。在序列水平上,这种分化表现为各亚组特有的残基模式所产生的相关性。这样一个超家族可以被看作是对应于一个复杂的高维概率分布的序列群体。在这里,我们将这种分布建模为分层相互关联的隐马尔可夫模型(hiHMMs),它隐含地描述了这些序列相关性。通过表征这种相关性,人们可能希望获得有关迄今尚未被检测到的功能相关特性的信息。为此,我们使用贝叶斯定理和马尔可夫链蒙特卡罗(MCMC)采样从序列数据中推断出hiHMM分布,这被广泛认为是表征复杂高维分布的最有效方法。然后,其他程序将相关的残基模式映射到可用结构上,以生成假设。当应用于N-乙酰转移酶时,这揭示了表明功能重要但通常未知的生化特性的序列和结构特征。即使对于除了未注释的序列和结构之外一无所知的蛋白质组,这也能带来有益的见解。例如,我们描述了一种由精氨酸残基在盐桥和π-π堆积相互作用之间切换介导的假定辅酶A诱导契合底物结合机制。一套实现这种方法的程序可供使用(psed.igs.umaryland.edu)。